home *** CD-ROM | disk | FTP | other *** search
Text File | 1994-09-01 | 114.9 KB | 2,877 lines |
- CHAPTER 8
-
- SORTING AND SEARCHING
-
-
- Two fundamental operations required of many applications are searching and
- sorting the data they operate on. Many different types of data are
- commonly sorted, such as customer names, payment due dates, or even a list
- of file names displayed in a file selection menu. If you are writing a
- programmer's cross reference utility, you may need to sort a list of
- variable names without regard to capitalization. In some cases, you may
- want to sort several pieces of related information based on the contents
- of only one of them. One example of that is a list of names and addresses
- sorted in ascending zip code order.
- Searching is equally important; for example, to locate a customer name
- in an array or disk file. In some cases you may wish to search for a
- complete match, while in others a partial match is needed. If you are
- searching a list of names for, say, Leonard, you probably would want to
- ignore Leonardo. But when searching a list of zip codes you may need to
- locate all that begin with the digits 068. There are many different ways
- sorting and searching can be accomplished, and the subject is by no means
- a simple one.
- Most programmers are familiar with the Bubble Sort, because it is the
- simplest to understand. Each adjacent pair of items is compared, and then
- exchanged if they are out of order. This process is repeated over and
- over, until the entire list has been examined as many times as there are
- items. Unfortunately, these repeated comparisons make the Bubble Sort an
- extremely poor performer. Similarly, code to perform a linear search that
- simply examines each item in succession for a match is easy to grasp, but
- it will be painfully slow when there are many items.
- In this chapter you will learn how sophisticated algorithms that handle
- these important programming chores operate. You will also learn how to
- sort data on more than one key. Often, it is not sufficient to merely sort
- a list of customers by their last name. For example, you may be expected
- to sort first by last name, then by first name, and finally by balance due.
- That is, all of the last names would first be sorted. Then within all of
- the Smiths you would sort again by first name, and for all of the John
- Smiths sort that subgroup based on how much money is owed.
- For completeness I will start each section by introducing sorting and
- searching methods that are easy to understand, and then progress to the
- more complex algorithms that are much more effective. Specifically, I will
- show the Quick Sort and Binary Search algorithms. When there are many
- thousands of data items, a good algorithm can make the difference between
- a sort routine that takes ten minutes to complete, and one that needs only
- a few seconds.
- Finally, I will discuss both BASIC and assembly language sort routines.
- As important as the right algorithm is for good performance, an assembly
- language implementation will be even faster. Chapter 12 describes how
- assembly language routines are written and how they work, and in this
- chapter I will merely show how to use the routines included with this book.
-
-
- SORTING FUNDAMENTALS
- ====================
-
- Although there are many different ways to sort an array, the simplest
- sorting algorithm is the Bubble Sort. The name Bubble is used because a
- FOR/NEXT loop repeatedly examines each adjacent pair of elements in the
- array, and those that have higher values rise to the top like bubbles in
- a bathtub. The most common type of sort is ascending, which means that "A"
- comes before "B", which comes before "C", and so forth. Figure 8-1 shows
- how the name Zorba ascends to the top of a five-item list of first names.
-
- Initial array contents:
-
- Element 4 Kathy
- Element 3 Barbara
- Element 2 Cathy
- Element 1 Zorba <
-
-
- After 1 pass:
-
- Element 4 Kathy
- Element 3 Barbara
- Element 2 Zorba <
- Element 1 Cathy
-
-
- After 2 passes:
-
- Element 4 Kathy
- Element 3 Zorba <
- Element 2 Barbara
- Element 1 Cathy
-
-
- After 3 passes:
-
- Element 4 Zorba <
- Element 3 Kathy
- Element 2 Barbara
- Element 1 Cathy
-
- Figure 8.1: Data ascending a list during a bubble sort.
-
- The Bubble Sort routine that follows uses a FOR/NEXT loop to repeatedly
- examine an array and exchange elements as necessary, until all of the items
- are in the correct order.
-
- DEFINT A-Z
- DECLARE SUB BubbleSort (Array$())
-
- CONST NumItems% = 20
- CONST False% = 0
- CONST True% = -1
-
- DIM Array$(1 TO NumItems%)
- FOR X = 1 TO NumItems%
- READ Array$(X)
- NEXT
-
- CALL BubbleSort(Array$())
-
- CLS
- FOR X = 1 TO NumItems%
- PRINT Array$(X)
- NEXT
-
- DATA Zorba, Cathy, Barbara, Kathy, Josephine
- DATA Joseph, Joe, Peter, Arnold, Glen
- DATA Ralph, Elli, Lucky, Rocky, Louis
- DATA Paula, Paul, Mary Lou, Marilyn, Keith
- END
-
- SUB BubbleSort (Array$()) STATIC
-
- DO
- OutOfOrder = False% 'assume it's sorted
- FOR X = 1 TO UBOUND(Array$) - 1
- IF Array$(X) > Array$(X + 1) THEN
- SWAP Array$(X), Array$(X + 1) 'if we had to swap
- OutOfOrder = True% 'we may not be done
- END IF
- NEXT
- LOOP WHILE OutOfOrder
-
- END SUB
-
- This routine is simple enough to be self-explanatory, and only a few things
- warrant discussing. One is the OutOfOrder flag variable. When the array
- is nearly sorted to begin with, fewer passes through the loop are needed.
- The OutOfOrder variable determines when no more passes are necessary. It
- is cleared at the start of each loop, and set each time two elements are
- exchanged. If, after examining all of the elements in one pass no
- exchanges were required, then the sorting is done and there's no need for
- the DO loop to continue.
- The other item worth mentioning is that the FOR/NEXT loop is set to
- consider one element less than the array actually holds. This is necessary
- because each element is compared to the one above it. If the last element
- were included in the loop, then BASIC would issue a "Subscript out of
- range" error on the statement that examines Array$(X + 1).
- There are a number of features you can add to this Bubble Sort routine.
- For example, you could sort without regard to capitalization. In that case
- "adams" would come before "BAKER", even though the lowercase letter "a" has
- a higher ASCII value than the uppercase letter "B". To add that capability
- simply use BASIC's UCASE$ (or LCASE$) function as part of the comparisons:
-
- IF UCASE$(Array$(X)) > UCASE$(Array$(X + 1)) THEN
-
- And to sort based on the eight-character portion that starts six bytes
- into each string you would use this:
-
- IF MID$(Array$(X), 5, 8) > MID$(Array$(X + 1), 5, 8) THEN
-
- Although the comparisons in this example are based on just a portion of
- each string, the SWAP statement must exchange the entire elements. This
- opens up many possibilities as you will see later in this chapter.
- If there is a chance that the strings may contain trailing blanks that
- should be ignored, you can use RTRIM$ on each pair of elements:
-
- IF RTRIM$(Array$(X)) > RTRIM$(Array$(X + 1)) THEN
-
- Of course, you can easily combine these enhancements to consider only the
- characters in the middle after they have been converted to upper or lower
- case.
- Sorting in reverse (descending) order is equally easy; you'd simply
- replace the greater-than symbol (>) with a less-than symbol (<).
- Finally, you can modify the routine to work with any type of data by
- changing the array type identifier. That is, for every occurrence of
- Array$ you will change that to Array% or Array# or whatever is appropriate.
- If you are sorting a numeric array, then different modifications may be in
- order. For example, to sort ignoring whether the numbers are positive or
- negative you would use BASIC's ABS (absolute value) function:
-
- IF ABS(Array!(X)) > ABS(Array!(X + 1)) THEN
-
- It is important to point out that all of the simple modifications described
- here can also be applied to the more sophisticated sort routines we will
- look at later in this chapter.
-
-
- INDEXED SORTS
-
- Besides the traditional sorting methods--whether a Bubble Sort or Quick
- Sort or any other type of sort--there is another category of sort routine
- you should be familiar with. Where a conventional sort exchanges elements
- in an array until they are in order, an Index Sort instead exchanges
- elements in a parallel numeric array of *pointers*. The original data is left
- intact, so it may still be accessed in its natural order. However, the
- array can also be accessed in sorted order by using the element numbers
- contained in the index array.
- As with a conventional sort, the comparisons in an indexed sort routine
- examine each element in the primary array, but based on the element numbers
- in that index array. If it is determined that the data is out of order,
- the routine exchanges the elements in the index array instead of the
- primary array. A modification to the Bubble Sort routine to sort using an
- index is shown below.
-
- DEFINT A-Z
- DECLARE SUB BubbleISort (Array$(), Index())
-
- CONST NumItems% = 20
- CONST False% = 0
- CONST True% = -1
-
- DIM Array$(1 TO NumItems%) 'this holds the string data
- DIM Ndx(1 TO NumItems%) 'this holds the index
-
- FOR X = 1 TO NumItems%
- READ Array$(X) 'read the string data
- Ndx(X) = X 'initialize the index array
- NEXT
-
- CALL BubbleISort(Array$(), Ndx())
-
- CLS
- FOR X = 1 TO NumItems%
- PRINT Array$(Ndx(X)) 'print based on the index
- NEXT
-
- DATA Zorba, Cathy, Barbara, Kathy, Josephine
- DATA Joseph, Joe, Peter, Arnold, Glen
- DATA Ralph, Elli, Lucky, Rocky, Louis
- DATA Paula, Paul, Mary lou, Marilyn, Keith
-
- SUB BubbleISort (Array$(), Index()) STATIC
-
- DO
- OutOfOrder = False% 'assume it's sorted
- FOR X = 1 TO UBOUND(Array$) - 1
- IF Array$(Index(X)) > Array$(Index(X + 1)) THEN
- SWAP Index(X), Index(X + 1) 'if we had to swap
- OutOfOrder% = True% 'we're not done yet
- END IF
- NEXT
- LOOP WHILE OutOfOrder%
-
- END SUB
-
- In this indexed sort, all references to the data are through the index
- array. And when a swap is necessary, it is the index array elements that
- are exchanged. Note that an indexed sort requires that the index array be
- initialized to increasing values--even if the sort routine is modified to
- be descending instead of ascending. Therefore, when BubbleISort is called
- Ndx(1) must hold the value 1, Ndx(2) is set to 2, and so forth.
- In this example the index array is initialized by the caller. However,
- it would be just as easy to put that code into the subprogram itself.
- Since you can't pass an array that hasn't yet been dimensioned, it makes
- the most sense to do both steps outside of the subprogram. Either way, the
- index array must be assigned to these initial values.
- As I mentioned earlier, one feature of an indexed sort is that it lets
- you access the data in both its original and sorted order. But there are
- other advantages, and a disadvantage as well. The disadvantage is that
- each comparison takes slightly longer, because of the additional overhead
- required to first look up the element number in the index array, to
- determine which elements in the primary array will be compared. In some
- cases, though, that can be more than offset by requiring less time to
- exchange elements.
- If you are sorting an array of 230-byte TYPE variables, the time needed
- for SWAP to exchange the elements can become considerable. Every byte in
- both elements must be read and written, so the time needed increases
- linearly as the array elements become longer. Contrast that with the fixed
- two bytes in the integer index array that are swapped.
- Another advantage of an indexed sort is that it lends itself to sorting
- more data than can fit in memory. As you will see later in the section
- that shows how to sort files, it is far easier to manipulate an integer
- index than an entire file. Further, sorting the file data using multiple
- passes requires twice as much disk space as the file already occupies.
-
-
- DATA MANIPULATION TECHNIQUES
-
- Before I show the Quick Sort algorithm that will be used as a basis for
- the remaining sort examples in this chapter, you should also be aware of
- a few simple tricks that can help you maintain and sort your data. One was
- described in Chapter 6, using a pair of functions that pack and unpack
- dates such that the year is stored before the month, which in turn is
- before the day. Thus, date strings are reduced to only three characters
- each, and they can be sorted directly.
- Another useful speed-up trick is to store string data as integers or
- long integers. If you had a system of four-digit account numbers you could
- use an integer instead of a string. Besides saving half the memory and
- disk space, the integer comparisons in a sort routine will be many times
- faster than a comparison on string equivalents. Zip codes are also suited
- to this, and could be stored in a long integer. Even though the space
- savings is only one byte, the time needed to compare the values for sorting
- will be greatly reduced.
- This brings up another important point. As you learned in Chapter 2,
- all conventional (not fixed-length) strings require more memory than might
- be immediately apparent. Besides the amount of memory needed to hold the
- data itself, four additional bytes are used for a string descriptor, and
- two more beyond those for a back pointer. Therefore, a zip code stored as
- a string will actually require eleven bytes rather than the five you might
- expect. With this in mind, you may be tempted to think that using a fixed-
- length string to hold the zip codes will solve the problem. Since fixed-
- length strings do not use either descriptors or back pointers, they do not
- need the memory they occupy. And that leads to yet another issue.
- Whenever a fixed-length string or the string portion of a TYPE variable
- is compared, it must first be converted to a regular descriptored string.
- BASIC has only one string comparison routine, and it expects the addresses
- for two conventional string descriptors. Every time a fixed-length string
- is used as an argument for comparison, BASIC must create a temporary copy,
- call its comparison routine, and then delete the copy. This copying adds
- code and wastes an enormous amount of time; in many cases the copying will
- take longer than the comparison itself. Therefore, using integers and long
- integers for numeric data where possible will provide more improvement than
- just the savings in memory use.
- In some cases, however, you must use fixed-length string or TYPE arrays.
- In particular, when sorting information from a random access disk file it
- is most sensible to load the records into a TYPE array. And as you learned
- in Chapter 2, the string components of a TYPE variable or array element are
- handled by BASIC as a fixed-length string. So how can you effectively sort
- fixed-length string arrays without incurring the penalty BASIC's overhead
- imposes? With assembly language subroutines, of course!
- Rather than ask BASIC to pass the data using its normal methods,
- assembly language routines can be invoked passing the data segments and
- addresses directly. When you use SEG, or a combination of VARSEG and
- VARPTR with fixed-length and TYPE variables, BASIC knows that you want the
- segmented address of the variable or array element. Thus, you are tricking
- BASIC into not making a copy as it usually would when passing such data.
- An assembly language subroutine or function can be designed to accept data
- addresses in any number of ways. As you will see later when we discuss
- sorting on multiple keys, extra trickery is needed to do the same thing in
- a BASIC procedure.
- The three short assembly language functions that follow compare two
- portions of memory, and then return a result that can be tested by your
- program.
-
- ;COMPARE.ASM - compares two ranges of memory
-
- .Model Medium, Basic
- .Code
-
- Compare Proc Uses DS ES DI SI, SegAdr1:DWord, _
- SegAdr2:DWord, NumBytes:Word
-
- Cld ;compare in the forward direction
- Mov SI,NumBytes ;get the address for NumBytes%
- Mov CX,[SI] ;put it into CX for comparing below
-
- Les DI,SegAdr1 ;load ES:DI with the first
- ; segmented address
- Lds SI,SegAdr2 ;load DS:SI with the second
- ; segmented address
-
- Repe Cmpsb ;do the compare
- Mov AX,0 ;assume the bytes didn't match
- Jne Exit ;we were right, skip over
- Dec AX ;wrong, decrement AX down to -1
-
- Exit:
- Ret ;return to BASIC
-
- Compare Endp
- End
-
- ;COMPARE2.ASM - compares memory case-insensitive
-
- .Model Medium, Basic
- .Code
-
- Compare2 Proc Uses DS ES DI SI, SegAdr1:DWord, _
- SegAdr2:DWord, NumBytes:Word
-
- Cld ;compare in the forward direction
- Mov BX,-1 ;assume the ranges are the same
-
- Mov SI,NumBytes ;get the address for NumBytes%
- Mov CX,[SI] ;put it into CX for comparing below
- Jcxz Exit ;if zero bytes were given, they're
- ; the same
- Les DI,SegAdr1 ;load ES:DI with the first address
- Lds SI,SegAdr2 ;load DS:SI with the second address
-
- Do:
- Lodsb ;load the current character from
- ; DS:SI into AL
- Call Upper ;capitalize as necessary
- Mov AH,AL ;copy the character to AH
-
- Mov AL,ES:[DI] ;load the other character into AL
- Inc DI ;point at the next one for later
- Call Upper ;capitalize as necessary
-
- Cmp AL,AH ;now, are they the same?
- Jne False ;no, exit now and show that
- Loop Do ;yes, continue
- Jmp Short Exit ;if we get this far, the bytes are
- ; all the same
- False:
- Inc BX ;increment BX to return zero
-
- Exit:
- Mov AX,BX ;assign the function output
- Ret ;return to BASIC
-
- Upper:
- Cmp AL,"a" ;is the character below an "a"?
- Jb Done ;yes, so we can skip it
- Cmp AL,"z" ;is the character above a "z"?
- Ja Done ;yes, so we can skip that too
- Sub AL,32 ;convert to upper case
-
- Done:
- Retn ;do a near return to the caller
-
- Compare2 Endp
- End
-
- ;COMPARE3.ASM - case-insensitive, greater/less than
-
- .Model Medium, Basic
- .Code
-
- Compare3 Proc Uses DS ES DI SI, SegAdr1:DWord, _
- SegAdr2:DWord, NumBytes:Word
-
- Cld ;compare in the forward direction
- Xor BX,BX ;assume the ranges are the same
-
- Mov SI,NumBytes ;get the address for NumBytes%
- Mov CX,[SI] ;put it into CX for comparing below
- Jcxz Exit ;if zero bytes were given, they're
- ; the same
- Les DI,SegAdr1 ;load ES:DI with the first address
- Lds SI,SegAdr2 ;load DS:SI with the second address
-
- Do:
- Lodsb ;load the current character from
- ; DS:SI into AL
- Call Upper ;capitalize as necessary, remove for
- ; case-sensitive
- Mov AH,AL ;copy the character to AH
-
- Mov AL,ES:[DI] ;load the other character into AL
- Inc DI ;point at the next character for later
- Call Upper ;capitalize as necessary, remove for
- ; case-sensitive
-
- Cmp AL,AH ;now, are they the same?
- Loope Do ;yes, continue
- Je Exit ;we exhausted the data and they're
- ; the same
- Mov BL,1 ;assume block 1 was "greater"
- Ja Exit ;we assumed correctly
- Dec BX ;wrong, bump BX down to -1
- Dec BX
-
- Exit:
- Mov AX,BX ;assign the function output
- Ret ;return to BASIC
-
- Upper:
- Cmp AL,"a" ;is the character below an "a"?
- Jb Done ;yes, so we can skip it
- Cmp AL,"z" ;is the character above a "z"?
- Ja Done ;yes, so we can skip that too
- Sub AL,32 ;convert to upper case
-
- Done:
- Retn ;do a near return to the caller
-
- Compare3 Endp
- End
-
- The first Compare routine above simply checks if all of the bytes are
- identical, and returns -1 (True) if they are, or 0 (False) if they are not.
- By returning -1 or 0 you can use either
-
- IF Compare%(Type1, Type2, NumBytes%) THEN
- or
- IF NOT Compare%(Type1, Type2, NumBytes%) THEN
-
- depending on which logic is clearer for your program. Compare2 is similar
- to Compare, except it ignores capitalization. That is, "SMITH" and Smith"
- are considered equal. The Compare3 function also compares memory and
- ignores capitalization, but it returns either -1, 0, or 1 to indicate if
- the first data range is less than, equal to, or greater than the second.
- The correct declaration and usage for each of these routines is shown
- below. Note that Compare and Compare2 are declared and used in the same
- fashion.
-
-
- Compare and Compare2:
-
- DECLARE FUNCTION Compare%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
- NumBytes%)
- Same = Compare%(Type1, Type2, NumBytes%)
-
- or
-
- DECLARE FUNCTION Compare%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
- BYVAL Adr2%, NumBytes%)
- Same = Compare%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)
-
-
- Here, Same receives -1 if the two TYPE variables or ranges of memory are
- the same, or 0 if they are not. NumBytes% tells how many bytes to compare.
-
-
- Compare3:
-
- DECLARE FUNCTION Compare3%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
- NumBytes%)
- Result = Compare3%(Type1, Type2, NumBytes%)
-
- or
- DECLARE FUNCTION Compare3%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
- BYVAL Adr2%, NumBytes%)
- Result = Compare3%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)
-
-
- Result receives 0 if the two type variables or ranges of memory are the
- same, -1 if the first is less when compared as strings, or 1 if the first
- is greater. NumBytes% tells how many bytes are to be to compared. In the
- context of a sort routine you could invoke Compare3 like this:
-
-
- IF Compare3%(TypeEl(X), TypeEl(X + 1), NumBytes%) = 1 THEN
- SWAP TypeEl(X), TypeEl(X + 1)
- END IF
-
-
- As you can see, these routines may be declared in either of two ways.
- When used with TYPE arrays the first is more appropriate and results in
- slightly less setup code being generated by the compiler. When comparing
- fixed-length strings or arbitrary blocks of memory (for example, when one
- of the ranges is on the display screen) you should use the second method.
- Since SEG does not work correctly with fixed-length strings, if you want
- to use that more efficient version you must create a dummy TYPE comprised
- solely of a single string portion:
-
-
- TYPE FixedLength
- Something AS STRING * 35
- END TYPE
-
-
- Then simply use DIM to create a single variable or an array based on this
- or a similar TYPE, depending on what your program needs. The requirement
- to create a dummy TYPE was discussed in Chapter 2, and I won't belabor the
- reasons again here. These comparison routines will be used extensively in
- the sort routines presented later in this chapter; however, their value in
- other, non-sorting situations should also be apparent.
- Although these routines are written in assembly language, they are
- fairly simple to follow. It is important to understand that you do not
- need to know anything about assembly language to use them. All of the
- files you need to add these and all of the other routines in this book are
- contained on the accompanying diskette [here, in the same ZIP file as this
- text]. Chapter 12 discusses assembly language in great detail, and you can
- refer there for further explanation of the instructions used.
- If you plan to run the programs that follow in the QuickBASIC editor,
- you must load the BASIC.QLB Quick Library as follows:
-
- qb program /l basic
-
- Later when you compile these or other programs you must link with the
- parallel BASIC.LIB file:
-
- bc program [/o];
- link program , , nul , basic;
-
- If you are using BASIC PDS start QBX using the BASIC7.QLB file, and then
- link with BASIC7.LIB to produce a stand-alone .EXE program. [VB/DOS users
- will also use the BASIC7 version.]
-
-
- THE QUICK SORT ALGORITHM
- ========================
-
- It should be obvious to you by now that a routine written in assembly
- language will always be faster than an equivalent written in BASIC.
- However, simply translating a procedure to assembly language is not always
- the best solution. Far more important than which language you use is
- selecting an appropriate algorithm. The best sorting method I know is the
- Quick Sort, and a well-written version of Quick Sort using BASIC will be
- many times faster than an assembly language implementation of the Bubble
- Sort.
- The main problem with the Bubble Sort is that the number of comparisons
- required grows exponentially as the number of elements increases. Since
- each pass through the array exchanges only a few elements, many passes are
- required before the entire array is sorted. The Quick Sort was developed
- by C.A.R. (Tony) Hoare, and is widely recognized as the fastest algorithm
- available. In some special cases, such as when the data is already sorted
- or nearly sorted, the Quick Sort may be slightly slower than other methods.
- But in most situations, a Quick Sort is many times faster than any other
- sorting algorithm.
- As with the Bubble Sort, there are many different variations on how a
- Quick Sort may be coded. (You may have noticed that the Bubble Sort shown
- in Chapter 7 used a nested FOR/NEXT loop, while the one shown here uses a
- FOR/NEXT loop within a DO/WHILE loop.) A Quick Sort divides the array into
- sections--sometimes called partitions--and then sorts each section
- individually. Many implementations therefore use recursion to invoke the
- subprogram from within itself, as each new section is about to be sorted.
- However, recursive procedures in any language are notoriously slow, and
- also consume stack memory at an alarming rate.
- The Quick Sort version presented here avoids recursion, and instead uses
- a local array as a form of stack. This array stores the upper and lower
- bounds showing which section of the array is currently being considered.
- Another refinement I have added is to avoid making a copy of elements in
- the array. As a Quick Sort progresses, it examines one element selected
- arbitrarily from the middle of the array, and compares it to the elements
- that lie above and below it. To avoid assigning a temporary copy this
- version simply keeps track of the selected element number.
- When sorting numeric data, maintaining a copy of the element is
- reasonable. But when sorting strings--especially strings whose length is
- not known ahead of time--the time and memory required to keep a copy can
- become problematic. For clarity, the generic Quick Sort shown below uses
- the copy method. Although this version is meant for sorting a single
- precision array, it can easily be adapted to sort any type of data by
- simply changing all instances of the "!" type declaration character.
-
- '******** QSORT.BAS, Quick Sort algorithm demonstration
-
- 'Copyright (c) 1991 Ethan Winer
-
- DEFINT A-Z
- DECLARE SUB QSort (Array!(), StartEl, NumEls)
-
- RANDOMIZE TIMER 'generate a new series each run
-
- DIM Array!(1 TO 21) 'create an array
- FOR X = 1 TO 21 'fill with random numbers
- Array!(X) = RND(1) * 500 'between 0 and 500
- NEXT
-
- FirstEl = 6 'sort starting here
- NumEls = 10 'sort this many elements
-
- CLS
- PRINT "Before Sorting:"; TAB(31); "After sorting:"
- PRINT "==============="; TAB(31); "=============="
-
- FOR X = 1 TO 21 'show them before sorting
- IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
- PRINT "==>";
- END IF
- PRINT TAB(5); USING "###.##"; Array!(X)
- NEXT
-
- CALL QSort(Array!(), FirstEl, NumEls)
-
- LOCATE 3
- FOR X = 1 TO 21 'print them after sorting
- LOCATE , 30
- IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
- PRINT "==>"; 'point to sorted items
- END IF
- LOCATE , 35
- PRINT USING "###.##"; Array!(X)
- NEXT
-
- SUB QSort (Array!(), StartEl, NumEls) STATIC
-
- REDIM QStack(NumEls \ 5 + 10) 'create a stack array
-
- First = StartEl 'initialize work variables
- Last = StartEl + NumEls - 1
-
- DO
- DO
- Temp! = Array!((Last + First) \ 2) 'seek midpoint
- I = First
- J = Last
-
- DO 'reverse both < and > below to sort descending
- WHILE Array!(I) < Temp!
- I = I + 1
- WEND
- WHILE Array!(J) > Temp!
- J = J - 1
- WEND
- IF I > J THEN EXIT DO
- IF I < J THEN SWAP Array!(I), Array!(J)
- I = I + 1
- J = J - 1
- LOOP WHILE I <= J
-
- IF I < Last THEN
- QStack(StackPtr) = I 'Push I
- QStack(StackPtr + 1) = Last 'Push Last
- StackPtr = StackPtr + 2
- END IF
-
- Last = J
- LOOP WHILE First < Last
-
- IF StackPtr = 0 THEN EXIT DO 'Done
- StackPtr = StackPtr - 2
- First = QStack(StackPtr) 'Pop First
- Last = QStack(StackPtr + 1) 'Pop Last
- LOOP
-
- ERASE QStack 'delete the stack array
-
- END SUB
-
- Notice that I have designed this routine to allow sorting only a portion
- of the array. To sort the entire array you'd simply omit the StartEl and
- NumEls parameters, and assign First and Last from the LBOUND and UBOUND
- element numbers. That is, you will change these:
-
- First = StartEl
- and
- Last = StartEl + NumEls - 1
-
- to these:
-
- First = LBOUND(Array!)
- and
- Last = UBOUND(Array!)
-
- As I mentioned earlier, the QStack array serves as a table of element
- numbers that reflect which range of elements is currently being considered.
- You will need to dimension this array to one element for every five
- elements in the primary array being sorted, plus a few extra for good
- measure. In this program I added ten elements, because one stack element
- for every five main array elements is not enough for very small arrays.
- For data arrays that have a large amount of duplicated items, you will
- probably need to increase the size of the stack array.
- Note that this ratio is not an absolute--the exact size of the stack
- that is needed depends on how badly out of order the data is to begin with.
- Although it is possible that one stack element for every five in the main
- array is insufficient in a given situation, I have never seen this formula
- fail. Because the stack is a dynamic integer array that is stored in far
- memory, it will not impinge on near string memory. If this routine were
- designed using the normal recursive method, BASIC's stack would be used
- which is in near memory.
- Each of the innermost DO loops searches the array for the first element
- in each section about the midpoint that belongs in the other section. If
- the elements are indeed out of order (when I is less than J) the elements
- are exchanged. This incrementing and comparing continues until I and J
- cross. At that point, assuming the variable I has not exceeded the upper
- limits of the current partition, the partition bounds are saved and Last
- is assigned to the top of the next inner partition level. When the entire
- partition has been processed, the previous bounds are retrieved, but as a
- new set of First and Last values. This process continues until no more
- partition boundaries are on the stack. At that point the entire array is
- sorted.
- On the accompanying disk you will find a program called SEEQSORT.BAS
- that contains an enhanced version of the QSort demo and subprogram. This
- program lets you watch the progress of the comparisons and exchanges as
- they are made, and actually see this complex algorithm operate. Simply
- load SEEQSORT.BAS into the BASIC editor and run it. A constant named
- Delay! is defined at the beginning of the program. Increasing its value
- makes the program run more slowly; decreasing it causes the program to run
- faster.
-
-
- AN ASSEMBLY LANGUAGE QUICK SORT
-
- As fast as the BASIC QuickSort routine is, we can make it even faster.
- The listing below shows an assembly language version that is between ten
- and twenty percent faster, depending on which compiler you are using and
- if the BASIC PDS /fs (far strings) option is in effect.
-
- ;SORT.ASM - sorts an entire BASIC string array
-
- .Model Medium, Basic
- .Data
- S DW 0
- F DW 0
- L DW 0
- I DW 0
- J DW 0
- MidPoint DW 0
-
- .Code
- Extrn B$SWSD:Proc ;this swaps two strings
- Extrn B$SCMP:Proc ;this compares two strings
-
- Sort Proc Uses SI DI ES, Array:Word, Dir:Word
-
- Cld ;all fills and compares are forward
- Push DS ;set ES = DS for string compares
- Pop ES
-
- Xor CX,CX ;clear CX
- Mov AX,7376h ;load AL and AH with the opcodes
- ; Jae and Jbe in preparation for
- ; code self-modification
- Mov BX,Dir ;get the sorting direction
- Cmp [BX],CX ;is it zero (ascending sort)?
- Je Ascending ;yes, skip ahead
- Xchg AL,AH ;no exchange the opcodes
-
- Ascending:
- Mov CS:[X1],AH ;install correct comparison opcodes
- Mov CS:[X2],AL ; based on the sort direction
-
- Mov BX,Array ;load the array descriptor address
- Mov AX,[BX+0Eh] ;save the number of elements
- Dec AX ;adjust the number to zero-based
- Jns L0 ;at least 1 element, continue
- Jmp L4 ;0 or less elements, get out now!
-
- L0:
- Mov BX,Array ;reload array descriptor address
- Mov BX,[BX] ;Array$(LBOUND) descriptor address
- Mov S,SP ;StackPtr = 0 (normalized to SP)
- Mov F,CX ;F = 0
- Mov L,AX ;L = Size%
-
- ;----- calculate the value of MidPoint
- L1:
- Mov DI,L ;MidPoint = (L + F) \ 2
- Add DI,F
- Shr DI,1
- Mov MidPoint,DI
-
- Mov AX,F ;I = F
- Mov I,AX
-
- Mov AX,L ;J = L
- Mov J,AX
-
- ;----- calculate the offset into the descriptor table for Array$(MidPoint)
- L1_2:
-
- Shl DI,1 ;multiply MidPoint in DI times 4
- Shl DI,1 ;now DI holds how far beyond Array$(Start)
- ; Array$(MidPoint)'s descriptor is
- Add DI,BX ;add the array base address to produce the final
- ; address for Array$(MidPoint)
-
- ;----- calculate descriptor offset for Array$(I)
- L2:
- Mov SI,I ;put I into SI
- Shl SI,1 ;as above
- Shl SI,1 ;now SI holds how far beyond Array$(Start)
- ; Array$(I)'s descriptor is
- Add SI,BX ;add the base to produce the final descriptor
- ; address
-
- ;IF Array$(I) < Array$(MidPoint) THEN I = I + 1: GOTO L2
- Push BX ;save BX because B$SCMP trashes it
- Push SI
- Push DI
- Call B$SCMP ;do the compare
- Pop BX ;restore BX
-
- X1 Label Byte ;modify the code below to "Jbe" if descending sort
- Jae L2_1 ;Array$(I) isn't less, continue on
-
- Inc Word Ptr I ;I = I + 1
- Jmp Short L2 ;GOTO L2
-
- ;----- calculate descriptor offset for Array$(J)
- L2_1:
- Mov SI,J ;put J into SI
- Shl SI,1 ;as above
- Shl SI,1 ;now SI holds how far beyond Array$(Start)
- ; Array$(J)'s descriptor is
- Add SI,BX ;add the base to produce the final descriptor
- ; address
-
- ;IF Array$(J) > Array$(MidPoint) THEN J = J - 1: GOTO L2.1
- Push BX ;preserve BX
- Push SI
- Push DI
- Call B$SCMP ;do the compare
- Pop BX ;restore BX
-
- X2 Label Byte ;modify the code below to "Jae" if descending sort
- Jbe L2_2 ;Array$(J) isn't greater, continue on
-
- Dec Word Ptr J ;J = J - 1
- Jmp Short L2_1 ;GOTO L2.1
-
- L2_2:
- Mov AX,I ;IF I > J GOTO L3
- Cmp AX,J
- Jg L3 ;J is greater, go directly to L3
- Je L2_3 ;they're the same, skip the swap
-
- ;Swap Array$(I), Array$(J)
- Mov SI,I ;put I into SI
- Mov DI,J ;put J into DI
-
- Cmp SI,MidPoint ;IF I = MidPoint THEN MidPoint = J
- Jne No_Mid1 ;not equal, skip ahead
- Mov MidPoint,DI ;equal, assign MidPoint = J
- Jmp Short No_Mid2 ;don't waste time comparing again
-
- No_Mid1:
- Cmp DI,MidPoint ;IF J = MidPoint THEN MidPoint = I
- Jne No_Mid2 ;not equal, skip ahead
- Mov MidPoint,SI ;equal, assign MidPoint = I
-
- No_Mid2:
- Mov SI,I ;put I into SI
- Shl SI,1 ;multiply times four for the
- Shl SI,1 ; for the descriptors
- Add SI,BX ;add address for first descriptor
-
- Mov DI,J ;do the same for J in DI
- Shl DI,1
- Shl DI,1
- Add DI,BX
-
- Push BX ;save BX because B$SWSD destroys it
- Call B$SWSD ;and swap 'em good
- Pop BX
-
- L2_3:
- Inc Word Ptr I ;I = I + 1
- Dec Word Ptr J ;J = J - 1
-
- Mov AX,I ;IF I <= J GOTO L2
- Cmp AX,J
- Jg L3 ;it's greater, skip to L3
- Mov DI,MidPoint ;get MidPoint again
- Jmp L1_2 ;go back to just before L2
-
- L3:
- Mov AX,I ;IF I < L THEN PUSH I: PUSH L
- Cmp AX,L
- Jnl L3_1 ;it's not less, so skip Pushes
-
- Push I ;Push I
- Push L ;Push L
-
- L3_1:
- Mov AX,J ;L = J
- Mov L,AX
-
- Mov AX,F ;IF F < L GOTO L1
- Cmp AX,L
- Jnl L3_2 ;it's not less, jump ahead to L3_2
- Jmp L1 ;it's less, go to L1
-
- L3_2:
- Cmp S,SP ;IF S = 0 GOTO L4
- Je L4
-
- Pop L ;Pop L
- Pop F ;Pop F
- Jmp L1 ;GOTO L1
-
- L4:
- Ret ;return to BASIC
-
- Sort Endp
- End
-
- Besides being faster than the BASIC version, the assembly language Sort
- routine is half the size. This version also supports sorting either
- forward or backward, but not just a portion of an array. The general
- syntax is:
-
- CALL Sort(Array$(), Direction)
-
- Where Array$() is any variable-length string array, and Direction is 0 for
- ascending, or any other value for descending. Note that this routine calls
- upon BASIC's internal services to perform the actual comparing and
- swapping; therefore, the exact same code can be used with either QuickBASIC
- or BASIC PDS. Again, I refer you forward to Chapter 12 for an explanation
- of the assembly language commands used in SORT.ASM.
-
-
- SORTING ON MULTIPLE KEYS
-
- In many situations, sorting based on one key is sufficient. For example,
- if you are sorting a mailing list to take advantage of bulk rates you must
- sort all of the addresses in order by zip code. When considering complex
- data such as a TYPE variable, it is easy to sort the array based on one
- component of each element. The earlier Bubble Sort example showed how MID$
- could be used to consider just a portion of each string, even though the
- entire elements were exchanged. Had that routine been designed to operate
- on a TYPE array, the comparisons would have examined just one component,
- but the SWAP statements would exchange entire elements:
-
- IF Array(X).ZipCode > Array(X + 1).ZipCode THEN
- SWAP Array(X), Array(X + 1)
- END IF
-
- This way, each customer's last name, first name, street address, and so
- forth remain connected to the zip codes that are being compared and
- exchanged.
- There are several ways to sort on more than one key, and all are of
- necessity more complex than simply sorting based on a single key. One
- example of a multi-key sort first puts all of the last names in order.
- Then within each group of identical last names the first names are sorted,
- and within each group of identical last and first names further sorting is
- performed on yet another key--perhaps Balance Due. As you can see, this
- requires you to sort based on differing types of data, and also to compare
- ranges of elements for the subgroups that need further sorting.
- The biggest complication with this method is designing a calling syntax
- that lets you specify all of the information. A table array must be
- established to hold the number of keys, the type of data in each key
- (string, double precision, and so forth), and how many bytes into the TYPE
- element each key portion begins. Worse, you can't simply use the name of
- a TYPE component in the comparisons inside the sort routine--which would
- you use: Array(X).LastName, Array(X).FirstName, or Array(X).ZipCode?
- Therefore, a truly general multi-key sort must be called passing the
- address where the array begins in memory, and a table of offsets beyond
- that address where each component being considered is located.
- To avoid this added complexity I will instead show a different method
- that has only a few minor restrictions, but is much easier to design and
- understand. This method requires you to position each TYPE component into
- the key order you will sort on. You will also need to store all numbers
- that will be used for a sort key as ASCII digits. To sort first on last
- name, then first name, and then on balance due, the TYPE might be
- structured as follows:
-
-
- TYPE Customer
- LastName AS STRING * 15
- FirstName AS STRING * 15
- BalanceDue AS STRING * 9
- Street AS STRING * 32
- City AS STRING * 15
- State AS STRING * 2
- ZipCode AS STRING * 5
- AnyNumber AS DOUBLE
- END TYPE
-
-
- In most cases the order in which each TYPE member is placed has no
- consequence. When you refer to TypeVar.LastName, BASIC doesn't care if
- LastName is defined before or after FirstName in the TYPE structure.
- Either way it translates your reference to LastName into an address.
- Having to store numeric data as strings is a limitation, but this is needed
- only for those TYPE fields that will be used as a sort key.
- The key to sorting on multiple items simultaneously is by treating the
- contiguous fields as a single long field. Since assignments to the string
- portion of a TYPE variable are handled internally by BASIC's LSET routine,
- the data in each element will be aligned such that subsequent fields can
- be treated as an extension of the primary field. Figure 8-2 below shows
- five TYPE array elements in succession, as they would be viewed by a string
- comparison routine. This data is defined as a subset of the name and
- address TYPE shown above, using just the first three fields. Notice that
- the balance due fields must be right-aligned (using RSET) for the numeric
- values to be considered correctly.
-
-
- Type.LastName Type.FirstName Type.BalanceDue
- ===============---------------=========
- Munro Jay 8000.00
- Smith John 122.03
- Johnson Alfred 14537.89
- Rasmussen Peter 100.90
- Hudson Cindy 21.22
- ^ ^ ^
- Field 1 Field 2 Field 3
- starts here starts here starts here
-
- Figure 8-2: Multiple contiguous fields in a TYPE can be treated as a single
- long field.
-
-
- Thus, the sort routine would be told to start at the first field, and
- consider the strings to be 15 + 15 + 9 = 39 characters long. This way all
- three fields are compared at one time, and treated as a single entity.
- Additional fields can of course follow these, and they may be included in
- the comparison or not at your option.
- The combination demonstration and subroutine below sorts such a TYPE
- array on any number of keys using this method, and it has a few additional
- features as well. Besides letting you confine the sorting to just a
- portion of the array, you may also specify how far into each element the
- first key is located. As long as the key fields are contiguous, they do
- not have to begin at the start of each TYPE. Therefore, you could sort
- just on the first name field, or on any other field or group of fields.
-
- 'TYPESORT.BAS - performs a multi-key sort on TYPE arrays
-
- 'Copyright (c) 1991 Ethan Winer
-
- DEFINT A-Z
- DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
- BYVAL Adr2, NumBytes)
- DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
- BYVAL Length)
- DECLARE SUB TypeSort (Segment, Address, ElSize, Offset, KeySize, NumEls)
-
- CONST NumEls% = 23 'this keeps it all on the screen
-
- TYPE MyType
- LastName AS STRING * 10
- FirstName AS STRING * 10
- Dollars AS STRING * 6
- Cents AS STRING * 2
- END TYPE
- REDIM Array(1 TO NumEls%) AS MyType
-
- '---- Disable (REM out) all but one of the following blocks to test
-
- Offset = 27 'start sorting with Cents
- ElSize = LEN(Array(1)) 'the length of each element
- KeySize = 2 'sort on the Cents only
-
- Offset = 21 'start sorting with Dollars
- ElSize = LEN(Array(1)) 'the length of each element
- KeySize = 8 'sort Dollars and Cents only
-
- Offset = 11 'start sorting with FirstName
- ElSize = LEN(Array(1)) 'the length of each element
- KeySize = 18 'sort FirstName through Cents
-
- Offset = 1 'start sorting with LastName
- ElSize = LEN(Array(1)) 'the length of each element
- KeySize = ElSize 'sort based on all 4 fields
-
- FOR X = 1 TO NumEls% 'build the array from DATA
- READ Array(X).LastName
- READ Array(X).FirstName
- READ Amount$ 'format the amount into money
- Dot = INSTR(Amount$, ".")
- IF Dot THEN
- RSET Array(X).Dollars = LEFT$(Amount$, Dot - 1)
- Array(X).Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
- ELSE
- RSET Array(X).Dollars = Amount$
- Array(X).Cents = "00"
- END IF
- NEXT
-
- Segment = VARSEG(Array(1)) 'show where the array is
- Address = VARPTR(Array(1)) ' located in memory
- CALL TypeSort(Segment, Address, ElSize, Offset, KeySize, NumEls%)
-
- CLS 'display the results
- FOR X = 1 TO NumEls%
- PRINT Array(X).LastName, Array(X).FirstName,
- PRINT Array(X).Dollars; "."; Array(X).Cents
- NEXT
-
- DATA Smith, John, 123.45
- DATA Cramer, Phil, 11.51
- DATA Hogan, Edward, 296.08
- DATA Cramer, Phil, 112.01
- DATA Malin, Donald, 13.45
- DATA Cramer, Phil, 111.3
- DATA Smith, Ralph, 123.22
- DATA Smith, John, 112.01
- DATA Hogan, Edward, 8999.04
- DATA Hogan, Edward, 8999.05
- DATA Smith, Bob, 123.45
- DATA Cramer, Phil, 11.50
- DATA Hogan, Edward, 296.88
- DATA Malin, Donald, 13.01
- DATA Cramer, Phil, 111.1
- DATA Smith, Ralph, 123.07
- DATA Smith, John, 112.01
- DATA Hogan, Edward, 8999.33
- DATA Hogan, Edward, 8999.17
- DATA Hogan, Edward, 8999.24
- DATA Smith, John, 123.05
- DATA Cramer, David, 1908.80
- DATA Cramer, Phil, 112
- END
-
- SUB TypeSort (Segment, Address, ElSize, Displace, KeySize, NumEls) STATIC
-
- REDIM QStack(NumEls \ 5 + 10) 'create a stack array
-
- First = 1 'initialize working variables
- Last = NumEls
- Offset = Displace - 1 'decrement once now rather than
- ' repeatedly later
- DO
- DO
- Temp = (Last + First) \ 2 'seek midpoint
- I = First
- J = Last
-
- DO
- WHILE Compare3%(Segment, Address + Offset + (I - 1) * ElSize, Segment, _
- Address + Offset + (Temp-1) * ElSize, KeySize) = -1 '< 1 for descending
- I = I + 1
- WEND
- WHILE Compare3%(Segment, Address + Offset + (J - 1) * ElSize, Segment, _
- Address + Offset + (Temp-1) * ElSize, KeySize) = 1 '< -1 for descending
- J = J - 1
- WEND
- IF I > J THEN EXIT DO
- IF I < J THEN
- CALL SwapMem(Segment, Address + (I - 1) * ElSize, Segment, _
- Address + (J - 1) * ElSize, ElSize)
- IF Temp = I THEN
- Temp = J
- ELSEIF Temp = J THEN
- Temp = I
- END IF
- END IF
- I = I + 1
- J = J - 1
- LOOP WHILE I <= J
-
- IF I < Last THEN
- QStack(StackPtr) = I 'Push I
- QStack(StackPtr + 1) = Last 'Push Last
- StackPtr = StackPtr + 2
- END IF
-
- Last = J
- LOOP WHILE First < Last
-
- IF StackPtr = 0 THEN EXIT DO 'Done
- StackPtr = StackPtr - 2
- First = QStack(StackPtr) 'Pop First
- Last = QStack(StackPtr + 1) 'Pop Last
- LOOP
-
- ERASE QStack 'delete the stack array
-
- END SUB
-
- As you can see, this version of the Quick Sort subprogram is derived from
- the one shown earlier. The important difference is that all of the
- incoming information is passed as segments, addresses, and bytes, rather
- than using an explicit array name. But before describing the inner details
- of the subprogram itself, I'll address the demonstration portion and show
- how the routine is set up and called.
- As with some of the other procedures on the disk that comes with this
- book, you will extract the TypeSort subprogram and add it to your own
- programs by loading it as a module, and then using the Move option of
- BASIC's View Subs menu. You can quickly access this menu by pressing F2,
- and then use Alt-M to select Move. Once this is done you will unload
- TYPESORT.BAS using the Alt-F-U menu selection, and answer *No* when asked
- if you want to save the modified file. You could also copy the TypeSort
- subprogram into a separate file, and then load that file as a module in
- each program that needs it.
- Although the example TYPE definition here shows only four components,
- you may of course use any TYPE structure. TypeSort expects six parameters
- to tell it where in memory the array is located, how far into each element
- the comparison routines are to begin, the total length of each element, the
- length of the key fields, and the number of elements to sort.
- After defining MyType, the setup portion of TYPESORT.BAS establishes the
- offset, element size, and key size parameters. As you can see, four
- different sample setups are provided, and you should add remarking
- apostrophes to all but one of them. If the program is left as is, the last
- setup values will take precedence.
- The next section reads sample names, addresses and dollar amounts from
- DATA statements, and formats the dollar amounts as described earlier. The
- dollar portion of the amounts are right justified into the Dollars field
- of each element, and the Cents portion is padded with trailing zeros as
- necessary to provide a dollars and cents format. This way, the value 12.3
- will be assigned as 12.30, and 123 will be formatted to 123.00 which gives
- the expected appearance.
- The final setup step is to determine where the array begins in memory.
- Since you specify the starting segment and address, it is simple to begin
- sorting at any array element. For example, to sort elements 100 through
- 200--even if the array is larger than that--you'd use VARSEG(Array(100))
- and VARPTR(Array(100) instead of element 1 as shown in this example.
- In addition to the starting segment and address of the array, TypeSort
- also requires you to tell it how many elements to consider. If you are
- sorting the entire array and the array starts with element 1, this will
- simply be the UBOUND of the array. If you are sorting just a portion of
- the array then you give it only the number of elements to be sorted. So
- to sort elements 100 through 200, the number of elements will be 101. A
- general formula you can use for calculating this based on element numbers
- is NumElements = LastElement - FirstElement + 1.
- Now let's consider the TypeSort subprogram itself. Since it is more
- like the earlier QSort program than different, I will cover only the
- differences here. In fact, the primary difference is in the way
- comparisons and exchanges are handled. The Compare3 function introduced
- earlier is used to compare the array elements with the midpoint. Although
- QSort made a temporary copy of the midpoint element, that would be
- difficult to do here. Since the routine is designed to work with any type
- of data--and the size of each element can vary depending on the TYPE
- structure--it is impractical to make a copy.
- While SPACE$ could be used to claim a block of memory into which the
- midpoint element is copied, there's a much better way: the Temp variable
- is used to remember the element number itself. The only complication is
- that once elements I and J are swapped, Temp must be reassigned if it was
- equal to either of them. (This happens just below the call to SwapMem.)
- But the simple integer IF test and assignment required adds far less code
- and is much faster than making a copy of the element.
- TypeSort is designed to sort the array in ascending order, and comments
- in the code show how to change it to sort descending instead. If you
- prefer to have one subprogram that can do both, you should add an extra
- parameter, perhaps called Direction. Near the beginning of the routine
- before the initial outer DO you would add this:
-
- IF Direction = 0 THEN 'sort ascending
- ICompare = -1
- JCompare = 1
- ELSE 'sort descending
- ICompare = 1
- JCompare = -1
- END IF
-
- Then, where the results from Compare3 are compared to -1 and 1 replace
- those comparisons (at the end of each WHILE line) to instead use ICompare
- and JCompare:
-
- WHILE Compare3%(...) = ICompare
- I = I + 1
- WEND
- WHILE Compare3%(...) = JCompare
- J = J - 1
- WEND
-
- This way, you are using variables to establish the sorting direction, and
- those variables can be set either way each time TypeSort is called.
- The last major difference is that elements are exchanged using the
- SwapMem routine rather than BASIC's SWAP statement. While it is possible
- to call SWAP by aliasing its name as shown in Chapter 5, it was frankly
- simpler to write a new routine for this purpose. Further, BASIC's SWAP is
- slower than SwapMem because it must be able to handle variables of
- different lengths, and also exchange fixed-length and conventional strings.
- SwapMem is extremely simple, and it works very quickly.
- As I stated earlier, the only way to write a truly generic sort routine
- is by passing segments and addresses and bytes, instead of array names.
- Although it would be great if BASIC could let you declare a subprogram or
- function using the AS ANY option to allow any type of data, that simply
- wouldn't work. As BASIC compiles your program, it needs to know the size
- and type of each parameter. When you reference TypeVar.LastName, BASIC
- knows where within TypeVar the LastName portion begins, and uses that in
- its address calculations. It is not possible to avoid this limitation
- other than by using addresses as is done here.
- Indeed, this is the stuff that C and assembly language programs are made
- of. In these languages--especially assembly language--integer pointer
- variables are used extensively to show where data is located and how long
- it is. However, the formulas used within the Compare3 and SwapMem function
- calls are not at all difficult to understand.
- The formula Address + Offset - (I - 1) * ElSize indicates where the key
- field of element I begins. Address holds the address of the beginning of
- the first element, and Offset is added to identify the start of the first
- sort key. (I - 1) is used instead of I because addresses are always zero-
- based. That is, the first element in the array from TypeSort's perspective
- is element 0, even though the calling program considers it to be element
- 1. Finally, the element number is multiplied times the length of each
- element, to determine the value that must be added to the starting address
- and offset to obtain the final address for the data in element I. Please
- understand that calculations such as these are what the compiler must do
- each time you access an array element.
- Note that if you call TypeSort incorrectly or give it illegal element
- numbers, you will not receive a "Subscript out of range" error from BASIC.
- Rather, you will surely crash your PC and have to reboot. This is the
- danger--and fun--of manipulating pointers directly.
- As I stated earlier, the SwapMem routine that does the actual exchanging
- of elements is very simple, and it merely takes a byte from one element and
- exchanges it with the corresponding byte in the other. This task is
- greatly simplified by the use of the XCHG assembly language command, which
- is similar to BASIC's SWAP statement. Although XCHG cannot swap a word in
- memory with another word in memory, it can exchange memory with a register.
- SwapMem is shown in the listing below.
-
- ;SWAPMEM.ASM, swaps two sections of memory
-
- .Model Medium, Basic
- .Code
-
- SwapMem Proc Uses SI DI DS ES, Var1:DWord, Var2:DWord, NumBytes:Word
-
- Lds SI,Var1 ;get the segmented address of the
- ; first variable
- Les DI,Var2 ;and for the second variable
- Mov CX,NumBytes ;get the number of bytes to exchange
- Jcxz Exit ;we can't swap zero bytes!
-
- DoSwap:
- Mov AL,ES:[DI] ;get a byte from the second variable
- Xchg AL,[SI] ;swap it with the first variable
- Stosb ;complete the swap and increment DI
- Inc SI ;point to the next source byte
- Loop DoSwap ;continue until done
-
- Exit:
- Ret ;return to BASIC
-
- SwapMem Endp
- End
-
- INDEXED SORTING ON MULTIPLE KEYS
-
- Earlier I showed how to modify the simple Bubble Sort routine to sort a
- parallel index array instead of the primary array. One important reason
- you might want to do that is to allow access to the primary array in both
- its original and sorted order. Another reason, and one we will get to
- shortly, is to facilitate sorting disk files. Although a routine to sort
- the records in a file could swap the actual data, it takes a long time to
- read and write that much data on disk. Further, each time you wanted to
- access the data sorted on a different key, the entire file would need to
- be sorted again.
- A much better solution is to create one or more sorted lists of record
- numbers, and store those on disk each in a separate file. This lets you
- access the data sorted by name, or by zip code, or by any other field,
- without ever changing the actual file. The TypeISort subprogram below is
- adapted from TypeSort, and it sorts an index array that holds the element
- numbers of a TYPE array.
-
- 'TYPISORT.BAS, indexed multi-key sort for TYPE arrays
-
- DEFINT A-Z
-
- DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
- BYVAL Adr2, NumBytes)
- DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
- BYVAL Adr2, BYVAL Length)
- DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
- NumEls, Index())
-
- CONST NumEls% = 23 'this keeps it all on the screen
-
- TYPE MyType
- LastName AS STRING * 10
- FirstName AS STRING * 10
- Dollars AS STRING * 6
- Cents AS STRING * 2
- END TYPE
- REDIM Array(1 TO NumEls%) AS MyType
- REDIM Index(1 TO NumEls%) 'create the index array
-
- Offset = 1 'start sorting with LastName
- ElSize = LEN(Array(1)) 'the length of each element
- KeySize = ElSize 'sort based on all 4 fields
-
- FOR X = 1 TO NumEls% 'build the array from DATA
- READ Array(X).LastName
- READ Array(X).FirstName
- READ Amount$
- ... 'this continues as already
- ... ' shown in TypeSort
- NEXT
-
- FOR X = 1 TO NumEls% 'initialize the index
- Index(X) = X - 1 'but starting with 0
- NEXT
-
- Segment = VARSEG(Array(1)) 'show where the array is
- Address = VARPTR(Array(1)) ' located in memory
- CALL TypeISort(Segment, Address, ElSize, Offset, KeySize, NumEls%, Index())
-
- CLS 'display the results
- FOR X = 1 TO NumEls% '+ 1 adjusts to one-based
- PRINT Array(Index(X) + 1).LastName,
- PRINT Array(Index(X) + 1).FirstName,
- PRINT Array(Index(X) + 1).Dollars; ".";
- PRINT Array(Index(X) + 1).Cents
- NEXT
-
- DATA Smith, John, 123.45 'this continues as already
- ... ' shown in TypeSort
- ...
-
- END
-
- SUB TypeISort (Segment, Address, ElSize, Displace, KeySize, NumEls, _
- Index()) STATIC
-
- REDIM QStack(NumEls \ 5 + 10) 'create a stack
-
- First = 1 'initialize working variables
- Last = NumEls
- Offset = Displace - 1 'make zero-based now for speed later
-
- DO
- DO
- Temp = (Last + First) \ 2 'seek midpoint
- I = First
- J = Last
-
- DO 'change -1 to 1 and 1 to -1 to sort descending
- WHILE Compare3%(Segment, Address + Offset + (Index(I) * ElSize), _
- Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = -1
- I = I + 1
- WEND
- WHILE Compare3%(Segment, Address + Offset + (Index(J) * ElSize), _
- Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = 1
- J = J - 1
- WEND
- IF I > J THEN EXIT DO
- IF I < J THEN
- SWAP Index(I), Index(J)
- IF Temp = I THEN
- Temp = J
- ELSEIF Temp = J THEN
- Temp = I
- END IF
- END IF
- I = I + 1
- J = J - 1
- LOOP WHILE I <= J
-
- IF I < Last THEN
- QStack(StackPtr) = I 'Push I
- QStack(StackPtr + 1) = Last 'Push Last
- StackPtr = StackPtr + 2
- END IF
-
- Last = J
- LOOP WHILE First < Last
-
- IF StackPtr = 0 THEN EXIT DO 'Done
- StackPtr = StackPtr - 2
- First = QStack(StackPtr) 'Pop First
- Last = QStack(StackPtr + 1) 'Pop Last
- LOOP
-
- ERASE QStack 'delete the stack array
-
- END SUB
-
- As with TypeSort, TypeISort is entirely pointer based so it can be used
- with any type of data and it can sort multiple contiguous keys. The only
- real difference is the addition of the Index() array parameter, and the
- extra level of indirection needed to access the index array each time a
- comparison is made. Also, when a swap is required, only the integer index
- elements are exchanged, which simplifies the code and reduces its size.
- Like TypeSort, you can change the sort direction by reversing the -1 and
- 1 values used with Compare3, or add a Direction parameter to the list and
- modify the code to use that.
- As with BubbleISort, the index array is initialized to increasing values
- by the calling program; however, here the first element is set to hold a
- value of 0 instead of 1. This reduces the calculations needed within the
- routine each time an address must be obtained. Therefore, when TypeISort
- returns, the caller must add 1 to the element number held in each index
- element. This is shown within the FOR/NEXT loop that displays the sorted
- results.
-
-
- SORTING FILES
-
- With the development of TypeISort complete, we can now use that routine
- to sort disk files. The sorting strategy will be to determine how many
- records are in the file, to determine how many separate passes are needed
- to process the entire file. TypeISort and TypeSort are restricted to
- working with arrays no larger than 64K (32K in the editing environment),
- so there is a limit as to how much data may be loaded into memory at one
- time. These sort routines can accommodate more data when compiled because
- address calculations that result in values larger than 32767 cause an
- overflow error in the QB editor. This overflow is in fact harmless, and
- is ignored in a compiled program unless you use the /d switch.
- Although the routines could be modified to perform segment and address
- arithmetic to accommodate larger arrays, that still wouldn't solve the
- problem of having more records than can fit in memory at once. Therefore,
- separate passes must be used to sort the file contents in sections, with
- each pass writing a temporary index file to disk. A final merge pass then
- reads each index to determine which pieces fits where, and then writes the
- final index file. The program FILESORT.BAS below incorporates all of the
- sorting techniques shown so far, and includes a few custom BASIC routines
- to improve its performance.
-
- 'FILESORT.BAS, indexed multi-key random access file sort
-
- DEFINT A-Z
-
- DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
- BYVAL Adr2, NumBytes)
- DECLARE FUNCTION Exist% (FileSpec$)
- DECLARE SUB DOSInt (Registers AS ANY)
- DECLARE SUB FileSort (FileName$, NDXName$, RecLength, Offset, KeySize)
- DECLARE SUB LoadFile (FileNum, Segment, Address, Bytes&)
- DECLARE SUB SaveFile (FileNum, Segment, Address, Bytes&)
- DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
- BYVAL Length)
- DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
- NumEls, Index())
-
- RANDOMIZE TIMER 'create new data each run
- DEF FnRand% = INT(RND * 10 + 1) 'returns RND from 1 to 10
-
- TYPE RegType 'used by DOSInt
- AX AS INTEGER
- BX AS INTEGER
- CX AS INTEGER
- DX AS INTEGER
- BP AS INTEGER
- SI AS INTEGER
- DI AS INTEGER
- FL AS INTEGER
- DS AS INTEGER
- ES AS INTEGER
- END TYPE
-
- DIM SHARED Registers AS RegType 'share among all subs
- REDIM LastNames$(1 TO 10) 'we'll select names at
- REDIM FirstNames$(1 TO 10) ' random from these
-
- NumRecords = 2988 'how many test records to use
- FileName$ = "TEST.DAT" 'really original, eh?
- NDXName$ = "TEST.NDX" 'this is the index file name
-
- TYPE RecType
- LastName AS STRING * 11
- FirstName AS STRING * 10
- Dollars AS STRING * 6
- Cents AS STRING * 2
- AnyNumber AS LONG 'this shows that only key
- OtherNum AS LONG ' information must be ASCII
- END TYPE
-
- FOR X = 1 TO 10 'read the possible last names
- READ LastNames$(X)
- NEXT
-
- FOR X = 1 TO 10 'and the possible first names
- READ FirstNames$(X)
- NEXT
-
- DIM RecordVar AS RecType 'to create the sample file
- RecLength = LEN(RecordVar) 'the length of each record
- CLS
- PRINT "Creating a test file..."
-
- IF Exist%(FileName$) THEN 'if there's an existing file
- KILL FileName$ 'kill the old data from prior
- END IF ' runs to start fresh
-
- IF Exist%(NDXName$) THEN 'same for any old index file
- KILL NDXName$
- END IF
-
-
- '---- Create some test data and write it to the file
- OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
- FOR X = 1 TO NumRecords
- RecordVar.LastName = LastNames$(FnRand%)
- RecordVar.FirstName = FirstNames$(FnRand%)
- Amount$ = STR$(RND * 10000)
- Dot = INSTR(Amount$, ".")
- IF Dot THEN
- RSET RecordVar.Dollars = LEFT$(Amount$, Dot - 1)
- RecordVar.Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
- ELSE
- RSET RecordVar.Dollars = Amount$
- RecordVar.Cents = "00"
- END IF
- RecordVar.AnyNumber = X
- PUT #1, , RecordVar
- NEXT
- CLOSE
-
- '----- Created a sorted index based on the main data file
- Offset = 1 'start sorting with LastName
- KeySize = 29 'sort based on first 4 fields
- PRINT "Sorting..."
- CALL FileSort(FileName$, NDXName$, RecLength, Offset, KeySize)
-
-
- '----- Display the results
- CLS
- VIEW PRINT 1 TO 24
- LOCATE 25, 1
- COLOR 15
- PRINT "Press any key to pause/resume";
- COLOR 7
- LOCATE 1, 1
-
- OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
- OPEN NDXName$ FOR BINARY AS #2
- FOR X = 1 TO NumRecords
- GET #2, , ThisRecord 'get next rec. number
- GET #1, ThisRecord, RecordVar 'then the actual data
-
- PRINT RecordVar.LastName; 'print each field
- PRINT RecordVar.FirstName;
- PRINT RecordVar.Dollars; ".";
- PRINT RecordVar.Cents
-
- IF LEN(INKEY$) THEN 'pause on a keypress
- WHILE LEN(INKEY$) = 0: WEND
- END IF
- NEXT
- CLOSE
-
- VIEW PRINT 1 TO 24 'restore the screen
- END
-
- DATA Smith, Cramer, Malin, Munro, Passarelli
- DATA Bly, Osborn, Pagliaro, Garcia, Winer
-
- DATA John, Phil, Paul, Anne, Jacki
- DATA Patricia, Ethan, Donald, Tami, Elli
- END
-
-
- FUNCTION Exist% (Spec$) STATIC 'reports if a file exists
-
- DIM DTA AS STRING * 44 'the work area for DOS
- DIM LocalSpec AS STRING * 60 'guarantee the spec is in
- LocalSpec$ = Spec$ + CHR$(0) ' DGROUP for BASIC PDS
-
- Exist% = -1 'assume true for now
-
- Registers.AX = &H1A00 'assign DTA service
- Registers.DX = VARPTR(DTA) 'show DOS where to place it
- Registers.DS = VARSEG(DTA)
- CALL DOSInt(Registers)
-
- Registers.AX = &H4E00 'find first matching file
- Registers.CX = 39 'any file attribute okay
- Registers.DX = VARPTR(LocalSpec)
- Registers.DS = VARSEG(LocalSpec)
- CALL DOSInt(Registers) 'see if there's a match
-
- IF Registers.FL AND 1 THEN 'if the Carry flag is set
- Exist% = 0 ' there were no matches
- END IF
-
- END FUNCTION
-
-
- SUB FileSort (FileName$, NDXName$, RecLength, Displace, KeySize) STATIC
-
- CONST BufSize% = 32767 'holds the data being sorted
- Offset = Displace - 1 'make zero-based for speed later
-
- '----- Open the main data file
- FileNum = FREEFILE
- OPEN FileName$ FOR BINARY AS #FileNum
-
- '----- Calculate the important values we'll need
- NumRecords = LOF(FileNum) \ RecLength
- RecsPerPass = BufSize% \ RecLength
- IF RecsPerPass > NumRecords THEN RecsPerPass = NumRecords
-
- NumPasses = (NumRecords \ RecsPerPass) - ((NumRecords MOD RecsPerPass) _
- <> 0)
- IF NumPasses = 1 THEN
- RecsLastPass = RecsPerPass
- ELSE
- RecsLastPass = NumRecords MOD RecsPerPass
- END IF
-
- '----- Create the buffer and index sorting arrays
- REDIM Buffer(1 TO 1) AS STRING * BufSize
- REDIM Index(1 TO RecsPerPass)
- IndexAdjust = 1
-
-
- '----- Process all of the records in manageable groups
- FOR X = 1 TO NumPasses
-
- IF X < NumPasses THEN 'if not the last pass
- RecsThisPass = RecsPerPass 'do the full complement
- ELSE 'the last pass may have
- RecsThisPass = RecsLastPass ' fewer records to do
- END IF
-
- FOR Y = 1 TO RecsThisPass 'initialize the index
- Index(Y) = Y - 1 'starting with value of 0
- NEXT
-
- '----- Load a portion of the main data file
- Segment = VARSEG(Buffer(1)) 'show where the buffer is
- CALL LoadFile(FileNum, Segment, Zero, RecsThisPass * CLNG(RecLength))
- CALL TypeISort(Segment, Zero, RecLength, Displace, KeySize, _
- RecsThisPass, Index())
-
- '----- Adjust the zero-based index to record numbers
- FOR Y = 1 TO RecsThisPass
- Index(Y) = Index(Y) + IndexAdjust
- NEXT
-
- '----- Save the index file for this pass
- TempNum = FREEFILE
- OPEN "$$PASS." + LTRIM$(STR$(X)) FOR OUTPUT AS #TempNum
- CALL SaveFile(TempNum, VARSEG(Index(1)), Zero, RecsThisPass * 2&)
- CLOSE #TempNum
-
- '----- The next group of record numbers start this much higher
- IndexAdjust = IndexAdjust + RecsThisPass
-
- NEXT
-
- ERASE Buffer, Index 'free up the memory
-
-
- '----- Do a final merge pass if necessary
- IF NumPasses > 1 THEN
-
- NDXNumber = FREEFILE
- OPEN NDXName$ FOR BINARY AS #NDXNumber
- REDIM FileNums(NumPasses) 'this holds the file numbers
- REDIM RecordNums(NumPasses) 'this holds record numbers
-
- REDIM MainRec$(1 TO NumPasses) 'holds main record data
- REDIM Remaining(1 TO NumPasses) 'tracks index files
-
- '----- Open the files and seed the first round of data
- FOR X = 1 TO NumPasses
- FileNums(X) = FREEFILE
- OPEN "$$PASS." + LTRIM$(STR$(X)) FOR BINARY AS #FileNums(X)
- Remaining(X) = LOF(FileNums(X)) 'this is what remains
- MainRec$(X) = SPACE$(RecLength) 'holds main data file
-
- GET #FileNums(X), , RecordNums(X) 'get the next record number
- RecOffset& = (RecordNums(X) - 1) * CLNG(RecLength) + 1
- GET #FileNum, RecOffset&, MainRec$(X) 'then get the data
- NEXT
-
- FOR X = 1 TO NumRecords
-
- Lowest = 1 'assume this is the lowest data in the group
- WHILE Remaining(Lowest) = 0 'Lowest can't refer to a dead index
- Lowest = Lowest + 1 'so seek to the next higher active index
- WEND
-
- FOR Y = 2 TO NumPasses 'now seek out the truly lowest element
- IF Remaining(Y) THEN 'consider only active indexes
- IF Compare3%(SSEG(MainRec$(Y)), _ '<-- use VARSEG with QB
- SADD(MainRec$(Y)) + Offset, _
- SSEG(MainRec$(Lowest)), _ '<-- use VARSEG with QB
- SADD(MainRec$(Lowest)) + Offset, KeySize) = -1 THEN
- Lowest = Y
- END IF
- END IF
- NEXT
-
- PUT #NDXNumber, , RecordNums(Lowest) 'write the main index
- Remaining(Lowest) = Remaining(Lowest) - 2
- IF Remaining(Lowest) THEN 'if the index is still active
- GET #FileNums(Lowest), , RecordNums(Lowest)
- RecOffset& = (RecordNums(Lowest) - 1) * CLNG(RecLength) + 1
- GET #FileNum, RecOffset&, MainRec$(Lowest)
- END IF
-
- NEXT
-
- ELSE
- '----- Only one pass was needed so simply rename the index file
- NAME "$$PASS.1" AS NDXName$
- END IF
-
- CLOSE 'close all open files
-
- IF Exist%("$$PASS.*") THEN 'ensure there's a file to kill
- KILL "$$PASS.*" 'kill the work files
- END IF
-
- ERASE FileNums, RecordNums 'erase the work arrays
- ERASE MainRec$, Remaining
-
- END SUB
-
-
- SUB LoadFile (FileNum, Segment, Address, Bytes&) STATIC
- IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
- Registers.AX = &H3F00 'read from file service
- Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
- Registers.CX = Bytes& 'how many bytes to load
- Registers.DX = Address 'and at what address
- Registers.DS = Segment 'and at what segment
- CALL DOSInt(Registers)
- END SUB
-
-
- SUB SaveFile (FileNum, Segment, Address, Bytes&) STATIC
- IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
- Registers.AX = &H4000 'write to file service
- Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
- Registers.CX = Bytes& 'how many bytes to load
- Registers.DX = Address 'and at what address
- Registers.DS = Segment 'and at what segment
- CALL DOSInt(Registers)
- END SUB
-
-
- SUB TypeISort (....) STATIC 'as shown in TYPISORT.BAS
-
- END SUB
-
- FILESORT.BAS begins by defining a function that returns a random number
- between 1 and 10. Although the earlier sort demonstrations simply read the
- test data from DATA statements, that is impractical when creating thousands
- of records. Instead, two arrays are filled--one with ten last names and
- another with ten first names--and these names are drawn from at random.
- The Registers TYPE variable that is defined is used by three of the
- supporting routines in this program. RegType is normally associated with
- CALL Interrupt and InterruptX, but I have written a small-code replacement
- to mimic InterruptX that works with DOS Interrupt &H21 only. DOSInt
- accepts just a single Registers argument, instead of the three parameters
- that BASIC's Interrupt and InterruptX require. Besides adding less code
- each time it is used, the routine itself is smaller and simpler than
- InterruptX.
- The remainder of the demonstration program should be easy to follow, so
- I won't belabor its operation; the real action is in the FileSort
- subprogram.
- Like TypeSort and TypeISort, FileSort is entirely pointer based, to
- accommodate TYPE elements of any size and structure. You provide the name
- of the main data file to be sorted, the name of an index file to create,
- and the length and offset of the keys within the disk records. The
- Displace parameter tells how far into the TYPE structure the key
- information is located. When calling TypeISort this value is should be
- one-based, but in the final merge pass where Compare3 is used, a zero-based
- number is required. Therefore, a copy is made (Offset = Displace - 1) near
- the beginning of the routine. This way, both are available quickly without
- having to calculate - 1 repeatedly slowing its operation.
- The initial steps FileSort performs are to determine how many records
- are in the data file, and from that how many records can fit into memory
- at one time. Once these are known, the number of passes necessary can be
- easily calculated. An extra step is needed to ensure that RecsPerPass is
- not greater than the number of records in the file. Just because 200
- records can fit into memory at once doesn't mean there are that many
- records. In most cases where multiple passes are needed the last pass will
- process fewer records than the others. If there are, say, 700 records and
- each pass can sort 300, the last pass will sort only 100 records.
- Once the pass information is determined, a block of memory is created
- to hold each portion of the file for sorting. This is the purpose of the
- Buffer array. REDIM is used to create a 32K chunk of memory that doesn't
- impinge on available string space.
- For each pass that is needed, the number of records in the current pass
- is determined and the index array is initialized to increasing values.
- Then, a portion of the main data file is read using the LoadFile
- subprogram. BASIC does not allow you to read records from a random access
- file directly into a buffer specified by its address. And even if it did,
- we can load data much faster than pure BASIC by reading a number of records
- all at once.
- Once the current block of records has been loaded, TypeISort is called
- to sort the index array. The index array is also saved very quickly using
- SaveFile, which is the compliment to LoadFile. A unique name is given to
- each temporary index file such that the first one is named $$PASS.1, the
- second $$PASS.2, and so forth. By using dollar signs in the name it is
- unlikely that the routine will overwrite an existing file from another
- application. Of course, you may change the names to anything else if you
- prefer.
- Notice the extra step that manipulates the IndexAdjust variable. This
- adjustment is needed because each sort pass returns the index array holding
- record numbers starting at 0. The first time through, 1 must be added to
- each element to reflect BASIC's use of record numbers that start at 1. If
- the first pass sorts, say, 250 records, then the index values 1 through 250
- are saved to disk. But the second pass is processing records 251 through
- 500, so an adjustment value of 251 must be added to each element prior to
- writing it to disk.
- If the data file is small and only one pass was needed, the $$PASS.1
- file is simply renamed to whatever the caller specified. Otherwise, a
- merge pass is needed to determine which record number is the next in
- sequence based on the results of each pass. Believe it or not, this is the
- trickiest portion of the entire program. For the sake of discussion, we'll
- assume that four passes were required to sort the file.
- Each of the four index files contains a sequence of record numbers, and
- all of the records within that sequence are in sorted order. However,
- there is no relationship between the data records identified in one index
- file and those in another. Thus, each index file and corresponding data
- record must be read in turn. A FOR/NEXT loop then compares each of the
- four records, to see which is truly next in the final sequence. The
- complication arises as the merge nears completion, because some of the
- indexes will have become exhausted. This possibility is handled by the
- Remaining array.
- Elements in the Remaining array are initialized to the length of each
- index file as the indexes are opened. Then, as each index entry is read
- from disk, the corresponding element is decremented by two to show that
- another record number was read. Therefore, the current Remaining element
- must be checked to see if that index has been exhausted. Otherwise, data
- that was already processed might be considered in the merge comparisons.
- The final steps are to close all the open files, delete the temporary
- index files, and erase the work arrays to free the memory they occupied.
- One important point to observe is the use of SSEG to show Compare3 where
- the MainRec$ elements are located. SSEG is for BASIC 7 only; if you are
- using QuickBASIC you must change SSEG to VARSEG. SSEG can be used with
- either near or far strings in BASIC 7, but VARSEG works with near strings
- only. SSEG is used as the default, so an error will be reported if you are
- using QuickBASIC. The cursor will then be placed near the comment in the
- program that shows the appropriate correction.
-
-
- SEARCHING FUNDAMENTALS
- ======================
-
- As with sorting, searching data effectively also requires that you select
- an appropriate algorithm. There are many ways to search data, and we will
- look at several methods here. The easiest to understand is a linear
- search, which simply examines each item in sequence until a match is found:
-
-
- FoundAt = 0 'assume no match
-
- FOR X = 1 TO NumElements 'search all elements
- IF Array$(X) = Sought$ THEN
- FoundAt = X 'remember where it is
- EXIT FOR 'no need to continue
- END IF
- NEXT
-
- IF FoundAt THEN 'if it was found
- PRINT "Found at element"; FoundAt
- ELSE
- PRINT "Not found" 'otherwise
- END IF
-
-
- For small arrays a linear search is effective and usually fast enough.
- Also, integer and long integer arrays can be searched reasonably quickly
- even if there are many elements. But with string data, as the number of
- elements that must be searched increases, the search time can quickly
- become unacceptable. This is particularly true when additional features
- are required such as searching without regard to capitalization or
- comparing only a portion of each element using MID$. Indeed, many of the
- same techniques that enhance a sort routine can also be employed when
- searching.
- To search ignoring capitalization you would first capitalize Sought$
- outside of the loop, and then use UCASE$ with each element in the
- comparisons. Using UCASE$(Sought$) repeatedly within the loop is both
- wasteful and unnecessary:
-
- Sought$ = UCASE$(Sought$)
- .
- .
- IF UCASE$(Array$(X)) = Sought$ THEN
-
- Likewise, comparing only a portion of each string will require MID$ with
- each comparison, after using MID$ initially to extract what is needed from
- Sought$:
-
- Sought$ = MID$(Sought$, 12, 6)
- .
- .
- IF MID$(Array$(X), 12, 6) = Sought$ THEN
-
- And again, as with sorting, these changes may be combined in a variety of
- ways. You could even use INSTR to see if the string being searched for
- is within the array, when an exact match is not needed:
-
- IF INSTR(UCASE$(Array$(X)), Sought$) THEN
-
- However, each additional BASIC function you use will make the searching
- slower and slower. Although BASIC's INSTR is very fast, adding UCASE$ to
- each comparison as shown above slows the overall process considerably.
- There are three primary ways that searching can be speeded up. One is
- to apply simple improvements based on understanding how BASIC works, and
- knowing which commands are fastest. The other is to select a better
- algorithm. The third is to translate selected portions of the search
- routine into assembly language. I will use all three of these techniques
- here, starting with enhancements to the linear search, and culminating with
- a very fast binary search for use with sorted data.
- One of the slowest operations that BASIC performs is comparing strings.
- For each string, its descriptor address must be loaded and passed to the
- comparison routine. That routine must then obtain the actual data address,
- and examine each byte in both strings until one of the characters is
- different, or it determines that both strings are the same. As I mentioned
- earlier, if one or both of the strings are fixed-length, then copies also
- must be made before the comparison can be performed.
- There is another service that the string comparison routine must
- perform, which is probably not obvious to most programmers and which also
- impacts its speed. BASIC frequently creates and then deletes temporary
- strings without your knowing it. One example is the copy it makes of
- fixed-length strings before comparing them. But there are other, more
- subtle situations in which this can happen.
- For example, when you use IF X$ + Y$ > Z$ BASIC must create a temporary
- string comprised of X$ + Y$, and then pass that to the comparison routine.
- Therefore, that routine is also responsible for determining if the incoming
- string is a temporary copy, and deleting it if so. In fact, all of BASIC's
- internal routines that accept string arguments are required to do this.
- Therefore, one good way to speed searching of conventional (not fixed-
- length) string arrays is to first compare the lengths. Since strings whose
- lengths are different can't possibly be the same, this will quickly weed
- those out. BASIC's LEN function is much faster than its string compare
- routine, and it offers a simple but effective opportunity to speed things
- up. LEN is made even faster because it requires only a single argument,
- as opposed to the two required for the comparison routine.
-
-
- SLen = LEN(Sought$) 'do this once outside the loop
- FOR X = 1 TO NumElements
- IF LEN(Array$(X)) = SLen THEN 'maybe...
- IF Array$(X) = Sought$ THEN 'found it!
- FoundAt = X
- EXIT FOR
- END IF
- END IF
- NEXT
-
-
- Similarly, if the first characters are not the same then the strings can't
- match either. Like LEN, BASIC's ASC is much faster than the full string
- comparison routine, and it too can improve search time by eliminating
- elements that can't possibly match. Depending on the type and distribution
- of the data in the array, using both LEN and ASCII can result in a very
- fast linear search:
-
-
- SLen = LEN(Sought$)
- SAsc = ASC(Sought$)
- FOR X = 1 TO NumElements
- IF LEN(Array$(X)) = SLen THEN
- IF ASC(Array$(X)) = SAsc THEN
- IF Array$(X) = Sought$ THEN
- ...
- END IF
- END IF
- END IF
- NEXT
-
-
- Notice that the LEN test must always be before the ASC test, to avoid an
- "Illegal function call" error if the array element is a null string. If
- all or most of the strings are the same length, then LEN will not be
- helpful, and ASC should be used alone.
- As I mentioned before, when comparing fixed-length string arrays BASIC
- makes a copy of each element into a conventional string, prior to calling
- its comparison routine. This copying is also performed when using ASC is
- used, but not LEN. After all, the length of a fixed-length never changes,
- and BASIC is smart enough to know the length directly. But then, comparing
- the lengths of these string is pointless anyway.
- Because of the added overhead to make these copies, the performance of
- a conventional linear search for fixed-length data is generally quite poor.
- This is a shame, because fixed-length strings are often the only choice
- when as much data as possible must be kept in memory at once. And fixed-
- length strings lend themselves perfectly to names and addresses. It should
- be apparent by now that the best solution for quickly comparing fixed-
- length string arrays--and the string portion of TYPE arrays too--is with
- the various Compare functions already shown.
- If you are searching for an exact match, then either Compare or Compare2
- will be ideal, depending on whether you want to ignore capitalization. If
- you have only a single string element in each array, you should define a
- dummy TYPE. This avoids the overhead of having to use both VARSEG and
- VARPTR as separate arguments. The short example program and SearchType
- functions that follow search a fixed-length string array for a match.
-
- DEFINT A-Z
- DECLARE FUNCTION Compare% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
- DECLARE FUNCTION Compare2% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
- DECLARE FUNCTION SearchType% (Array() AS ANY, Sought AS ANY)
- DECLARE FUNCTION SearchType2% (Array() AS ANY, Sought AS ANY)
- DECLARE FUNCTION SearchType3% (Array() AS ANY, Searched AS ANY)
-
- CLS
- TYPE FLen 'this lets us use SEG
- LastName AS STRING * 15
- END TYPE
-
- REDIM Array(1 TO 4000) AS FLen '4000 is a lot of names
- DIM Search AS FLen 'best comparing like data
-
- FOR X = 1 TO 4000 STEP 2 'impart some realism
- Array(X).LastName = "Henderson"
- NEXT
-
- Array(4000).LastName = "Henson" 'almost at the end
- Search.LastName = "Henson" 'find the same name
-
- '----- first time how long it takes using Compare
- Start! = TIMER 'start timing
-
- FOR X = 1 TO 5 'search five times
- FoundAt = SearchType%(Array(), Search)
- NEXT
-
- IF FoundAt >= 0 THEN
- PRINT "Found at element"; FoundAt
- ELSE
- PRINT "Not found"
- END IF
-
- Done! = TIMER
- PRINT USING "##.## seconds with Compare"; Done! - Start!
- PRINT
-
-
- '----- then time how long it takes using Compare2
- Start! = TIMER 'start timing
-
- FOR X = 1 TO 5 'as above
- FoundAt = SearchType2%(Array(), Search)
- NEXT
-
- IF FoundAt >= 0 THEN
- PRINT "Found at element"; FoundAt
- ELSE
- PRINT "Not found"
- END IF
-
- Done! = TIMER
- PRINT USING "##.## seconds with Compare2"; Done! - Start!
- PRINT
-
-
- '---- finally, time how long it takes using pure BASIC
- Start! = TIMER
-
- FOR X = 1 TO 5
- FoundAt = SearchType3%(Array(), Search)
- NEXT
-
- IF FoundAt >= 0 THEN
- PRINT "Found at element"; FoundAt
- ELSE
- PRINT "Not found"
- END IF
-
- Done! = TIMER
- PRINT USING "##.## seconds using BASIC"; Done! - Start!
- END
-
- FUNCTION SearchType% (Array() AS FLen, Sought AS FLen) STATIC
-
- SearchType% = -1 'assume not found
-
- FOR X = LBOUND(Array) TO UBOUND(Array)
- IF Compare%(Array(X), Sought, LEN(Sought)) THEN
- SearchType% = X 'save where it was found
- EXIT FOR 'and skip what remains
- END IF
- NEXT
-
- END FUNCTION
-
-
- FUNCTION SearchType2% (Array() AS FLen, Sought AS FLen) STATIC
-
- SearchType2% = -1 'assume not found
-
- FOR X = LBOUND(Array) TO UBOUND(Array)
- IF Compare2%(Array(X), Sought, LEN(Sought)) THEN
- SearchType2% = X 'save where it was found
- EXIT FOR 'and skip what remains
- END IF
- NEXT
-
- END FUNCTION
-
-
- FUNCTION SearchType3% (Array() AS FLen, Searched AS FLen) STATIC
-
- SearchType3% = -1 'assume not found
-
- FOR X = LBOUND(Array) TO UBOUND(Array)
- IF Array(X).LastName = Searched.LastName THEN
- SearchType3% = X 'save where it was found
- EXIT FOR 'and skip what remains
- END IF
- NEXT
-
- END FUNCTION
-
- When you run this program it will be apparent that the SearchType function
- is the fastest, because it uses Compare which doesn't perform any case
- conversions. SearchType2 is only slightly slower with that added overhead,
- and the purely BASIC function, SearchType3, lags far behind at half the
- speed. Note that the array is searched five times in succession, to
- minimize the slight errors TIMER imposes. Longer timings are generally
- more accurate than short ones, because of the 1/18th second resolution of
- the PC's system timer.
-
-
- BINARY SEARCHES
-
- This is about as far as we can go using linear searching, and to achieve
- higher performance requires a better algorithm. The Binary Search is one
- of the fastest available; however, it requires the data to already be in
- sorted order. A Binary Search can also be used with a sorted index, and
- both methods will be described.
- Binary searches are very fast, and also very simple to understand.
- Unlike the Quick Sort algorithm which achieves great efficiency at the
- expense of being complicated, a Binary Search can be written using only a
- few lines of code. The strategy is to start the search at the middle of
- the array. If the value of that element value is less than that of the
- data being sought, a new halfway point is checked and the process repeated.
- This way, the routine can quickly zero in on the value being searched for.
- Figure 8-3 below shows how this works.
-
-
- 13: Zambia
- 12: Sweden
- 11: Peru
- 10: Mexico <-- step 2
- 9: Holland
- 8: Germany
- 7: Finland <-- step 1
- 6: England
- 5: Denmark
- 4: China
- 3: Canada
- 2: Austria
- 1: Australia
-
- Figure 8-3: How a Binary Search locates data in a sorted array.
-
-
- If you are searching for Mexico, the first element examined is number 7,
- which is halfway through the array. Comparing Mexico to Finland shows
- that Mexico is greater, so the distance is again cut in half. In this
- case, a match was found after only two tries--remarkably faster than a
- linear search that would have required ten comparisons. Even when huge
- arrays must be searched, data can often be found in a dozen or so tries.
- One interesting property of a binary search is that it takes no longer to
- find the last element in the array than the first one.
- The program below shows one way to implement a Binary Search.
-
- DEFINT A-Z
- DECLARE FUNCTION BinarySearch% (Array$(), Find$)
-
- CLS
- PRINT "Creating test data..."
-
- REDIM Array$(1 TO 1000) 'create a "sorted" array
- FOR X = 1 TO 1000
- Array$(X) = "String " + RIGHT$("000" + LTRIM$(STR$(X)), 4)
- NEXT
-
- PRINT "Searching array..."
-
- FoundAt = BinarySearch%(Array$(), "String 0987")
- IF FoundAt >= 0 THEN
- PRINT "Found at element"; FoundAt
- ELSE
- PRINT "Not found"
- END IF
-
- END
-
-
- FUNCTION BinarySearch% (Array$(), Find$) STATIC
-
- BinarySearch% = -1 'no matching element yet
- Min = LBOUND(Array$) 'start at first element
- Max = UBOUND(Array$) 'consider through last
-
- DO
- Try = (Max + Min) \ 2 'start testing in middle
-
- IF Array$(Try) = Find$ THEN 'found it!
- BinarySearch% = Try 'return matching element
- EXIT DO 'all done
- END IF
-
- IF Array$(Try) > Find$ THEN 'too high, cut in half
- Max = Try - 1
- ELSE
- Min = Try + 1 'too low, cut other way
- END IF
- LOOP WHILE Max >= Min
-
- END FUNCTION
-
- The BinarySearch function returns either the element number where a match
- was found, or -1 if the search string was not found. Not using a value of
- zero to indicate failure lets you use arrays that start with element number
- 0. As you can see, the simplicity of this algorithm belies its incredible
- efficiency. The only real problem is that the data must already be in
- sorted order. Also notice that two string comparisons must be made--one
- to see if the strings are equal, and another to see if the current element
- is too high. Although you could use Compare3 which examines the strings
- once and tells if the data is the same or which is greater, a Binary Search
- is so fast that this probably isn't worth the added trouble. As you will
- see when you run the test program, it takes far longer to create the data
- than to search it!
- Besides the usual enhancements that can be applied to the comparisons
- using UCASE$ or MID$, this function could also be structured to use a
- parallel index array. Assuming the data is not sorted but the index array
- is, the modified Binary Search would look like this:
-
- FUNCTION BinaryISearch% (Array$(), Index(), Find$) STATIC
-
- BinaryISearch% = -1 'assume not found
- Min = LBOUND(Array$) 'start at first element
- Max = UBOUND(Array$) 'consider through last
-
- DO
- Try = (Max + Min) \ 2 'start testing in middle
-
- IF Array$(Index(Try)) = Find$ THEN 'found it!
- BinaryISearch% = Try 'return matching element
- EXIT DO 'all done
- END IF
-
- IF Array$(Index(Try)) > Find$ THEN 'too high, cut
- Max = Try - 1
- ELSE
- Min = Try + 1 'too low, cut other way
- END IF
- LOOP WHILE Max >= Min
-
- END FUNCTION
-
- NUMERIC ARRAYS
-
- All of the searching techniques considered so far have addressed string
- data. In most cases, string array searches are the ones that will benefit
- the most from improved techniques. As you have already seen, BASIC makes
- copies of fixed-length strings before comparing them, which slows down
- searching. And the very nature of strings implies that many bytes may have
- to be compared before determining if they are equal or which string is
- greater. In most cases, searching a numeric array is fast enough without
- requiring any added effort, especially when the data is integer or long
- integer.
- However, a few aspects of numeric searching are worth mentioning here.
- One is avoiding the inevitable rounding errors that are sure to creep into
- the numbers you are examining. Another is that in many cases, you may not
- be looking for an exact match. For example, you may need to find the first
- element that is higher than a given value, or perhaps determine the
- smallest value in an array.
- Unlike strings that are either the same or they aren't, the binary
- representation of numeric values is not always so precise. Consider the
- following test which *should* result in a match, but doesn't.
-
-
- Value! = 1!
- Result! = 2!
- CLS
-
- FOR X = 1 TO 1000
- Value! = Value! + .001
- NEXT
-
- IF Value! = Result! THEN
- PRINT "They are equal"
- ELSE
- PRINT "Value! ="; Value!
- PRINT "Result! ="; Result!
- END IF
-
-
- After adding .001 to Value! 1000 times Value! should be equal to 2, but
- instead it is slightly higher. This is because the binary storage method
- used by computers simply cannot represent every possible value with
- absolute accuracy. Even changing all of the single precision exclamation
- points (!) to double precision pound signs (#) will not solve the problem.
- Therefore, to find a given value in a numeric array can require some extra
- trickery.
- What is really needed is to determine if the numbers are *very close* to
- each other, as opposed to exactly the same. One way to accomplish this is
- to subtract the two, and see if the result is very close to zero. This is
- shown below.
-
-
- Value! = 1!
- Result! = 2!
- CLS
-
- FOR X = 1 TO 1000
- Value! = Value! + .001
- NEXT
-
- IF ABS(Value! - Result!) < .0001 THEN
- PRINT "They are equal"
- ELSE
- PRINT "Value! ="; Value!
- PRINT "Result! ="; Result!
- END IF
-
-
- Here, the absolute value of the difference between the numbers is examined,
- and if that difference is very small the numbers are assumed to be the
- same. Unfortunately, the added overhead of subtracting before comparing
- slows the comparison even further. There is no simple cure for this, and
- an array search must apply this subtraction to each element that is
- examined.
- Another common use for numeric array searches is when determining the
- largest or smallest value. Many programmers make the common mistake shown
- below when trying to find the largest value in an array.
-
-
- MaxValue# = 0
-
- FOR X = 1 TO NumElements
- IF Array#(X) > MaxValue# THEN
- MaxValue# = Array#(X)
- Element = X
- END IF
- NEXT
-
- PRINT "The largest value found is"; MaxValue#
- PRINT "And it was found at element"; Element
-
-
- The problem with this routine is that it doesn't account for arrays where
- all of the elements are negative numbers! In that case no element will be
- greater than the initial MaxValue#, and the routine will incorrectly report
- zero as the result. The correct method is to obtain the lowest element
- value, and use that as a starting point:
-
-
- MaxValue# = Array#(1)
-
- FOR X = 2 TO NumElements
- IF Array#(X) > MaxValue# THEN
- MaxValue# = Array#(X)
- END IF
- NEXT
-
- PRINT "The largest value found is"; MaxValue#
-
-
- Determining the highest value in an array would be handled similarly,
- except the greater-than symbol (>) would be replaced with a less-than
- operator (<).
-
-
- SOUNDEX
-
- The final searching technique I will show is Soundex. It is often useful
- to search for data based on its sound, for example when you do not know how
- to spell a person's name. Soundex was invented in the 1920's and has been
- used since then by, among others, the U.S. Census Bureau. A Soundex code
- is an alpha-numeric representation of the sound of a word, and it is
- surprisingly accurate despite its simplicity. The classic implementation
- of Soundex returns a four-character result code. The first character is
- the same as the first letter of the word, and the other three are numeric
- digits coded as shown in Figure 8-4.
-
-
- 1 B, F, P, V
- 2 C, G, J, K, Q, S, X
- 3 D, T
- 4 L
- 5 M, N
- 6 R
-
- Figure 8-4: The Soundex code numbers returned for significant letters of
- the alphabet.
-
-
- Letters not shown are simply skipped as being statistically insignificant
- to the sound of the word. In particular, speaking accents often minimize
- the importance of vowels, and blur their distinction. If the string is
- short and there are fewer than four digits, the result is simply padded
- with trailing zeros. One additional rule is that a code digit is never
- repeated, unless there is an uncoded letter in between. In the listing
- that follows, two different implementations of Soundex are shown.
-
- 'SOUNDEX.BAS, Soundex routines and example
-
- DEFINT A-Z
-
- DECLARE FUNCTION ASoundex$ (Word$)
- DECLARE FUNCTION ISoundex% (Word$)
-
- CLS
- DO
- PRINT "press Enter alone to exit"
- INPUT "What is the first word"; FWord$
- IF LEN(FWord$) = 0 THEN EXIT DO
- INPUT "What is the second word"; SWord$
- PRINT
-
- 'Test by alpha-numeric soundex
- PRINT "Alpha-Numeric Soundex: "; FWord$; " and ";
- PRINT SWord$; " do ";
- IF ASoundex$(FWord$) <> ASoundex$(SWord$) THEN
- PRINT "NOT ";
- END IF
- PRINT "sound the same."
- PRINT
-
- 'Test by numeric soundex
- PRINT " Numeric Soundex: "; FWord$; " and ";
- PRINT SWord$; " do ";
- IF ISoundex%(FWord$) <> ISoundex%(SWord$) THEN
- PRINT "NOT ";
- END IF
- PRINT "sound the same."
- PRINT
- LOOP
- END
-
-
- FUNCTION ASoundex$ (InWord$) STATIC
-
- Word$ = UCASE$(InWord$)
- Work$ = LEFT$(Word$, 1) + "000"
- WkPos = 2
- PrevCode = 0
-
- FOR L = 2 TO LEN(Word$)
- Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
- IF Temp THEN
- Temp = ASC(MID$("111122222222334556", Temp, 1))
- IF Temp <> PrevCode THEN
- MID$(Work$, WkPos) = CHR$(Temp)
- PrevCode = Temp
- WkPos = WkPos + 1
- IF WkPos > 4 THEN EXIT FOR
- END IF
- ELSE
- PrevCode = 0
- END IF
- NEXT
-
- ASoundex$ = Work$
-
- END FUNCTION
-
-
- FUNCTION ISoundex% (InWord$) STATIC
-
- Word$ = UCASE$(InWord$)
- Work$ = "0000"
- WkPos = 1
- PrevCode = 0
-
- FOR L = 1 TO LEN(Word$)
- Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
- IF Temp THEN
- Temp = ASC(MID$("111122222222334556", Temp, 1))
- IF Temp <> PrevCode THEN
- MID$(Work$, WkPos) = CHR$(Temp)
- PrevCode = Temp
- WkPos = WkPos + 1
- IF WkPos > 4 THEN EXIT FOR
- END IF
- ELSE
- PrevCode = 0
- END IF
- NEXT
-
- ISoundex% = VAL(Work$)
-
- END FUNCTION
-
-
- The first function, ASoundex, follows the standard Soundex definition and
- returns the result as a string. The ISoundex version cheats slightly by
- coding the first letter as a number, but it returns an integer value
- instead of a string. Because integer searches are many times faster than
- string searches, this version will be better when thousands--or even
- hundreds of thousands--of names must be examined.
- An additional benefit of the integer-only method is that it allows for
- variations on the first letter. For example, if you enter Cane and Kane
- in response to the prompts from SOUNDEX.BAS ASoundex will not recognize the
- names as sounding alike where ISoundex will.
-
-
- LINKED DATA
- ===========
-
- No discussion of searching and sorting would be complete without a mention
- of linked lists and other data links. Unlike arrays where all of the
- elements lie in adjacent memory locations, linked data is useful when data
- locations may be disjointed. One example is the linked list used by the
- DOS File Allocation Table (FAT) on every disk. As I described in Chapter
- 6, the data in each file may be scattered throughout the disk, and only
- through a linked list can DOS follow the thread from one sector in a file
- to another.
- Another example where linked data is useful--and the one we will focus
- on here--is to keep track of memo fields in a database. A memo field is
- a field that can store freeform text such as notes about a sales contact
- or a patient's medical history. Since these fields typically require
- varying lengths, it is inefficient to reserve space for the longest one
- possible in the main database file. Therefore, most programs store memo
- fields in a separate disk file, and use a *pointer field* in the main data
- file to show where the corresponding memo starts in the dedicated memo
- file. Similarly, a back pointer adjacent to each memo identifies the
- record that points to it. This is shown in Figure 8-5 below.
-
-
- ┌────────┬─────────┬─╥────────┬─────────┬──╥─────────
- MAIN.DAT │LastName│FirstName│1║LastName│FirstName│73║LastNa...
- └────────┴─────────┴─╨────────┴─────────┴──╨─────────
- ^ ^
- | |
- pointers into memo file -----+--------------------+
- (forward pointers)
-
-
-
- offsets into --+----------+-------+--------------+
- this memo file | | | |
- 1 73 126 233
- ┌──────────┬───────┬──────────────┬─────────┐
- MEMO.DAT │1LMemo1 │2LMemo2│3LMemo3 │4LMemo4 │
- └──────────┴───────┴──────────────┴─────────┘
- ^ ^ ^ ^
- | | | |
- record numbers +----------+-------+--------------+
- (back pointers)
-
- (L = length of this memo)
-
- Figure 8-5: Pointers relate record numbers to memo file offsets and vice
- versa.
-
-
- Here, the pointer in the main data file record is a long integer that holds
- the byte offset into the memo file where the corresponding memo text
- begins. And just before the memo text is an integer record number that
- shows which record this memo belongs to. (If you anticipate more than
- 65,535 records a long integer must be used instead.) Thus, these pointers
- provide links between the two files, and relate the information they
- contain.
- When a new record is added to the main file, the memo that goes with it
- is appended to the end of the memo file. BASIC's LOF function can be used
- to determine the current end of the memo file, which is then used as the
- beginning offset for the new memo text. And as the new memo is appended
- to MEMO.DAT, the first data actually written is the number of the new
- record in the main data file.
- The record number back pointer in the memo file is needed to allow memo
- data to be edited. Since there's no reasonable way to extend memo text
- when other memo data follows it, most programs simply abandon the old text,
- and allocate new space at the end of the file. The abandoned text is then
- marked as such, perhaps by storing a negative value as the record number.
- Storing a negative version of the abandoned data's length is ideal, because
- that both identifies the data as obsolete, and also tells how much farther
- into the file the next memo is located.
- The idea here is that you would periodically run a memo file maintenance
- program that compacts the file, thus eliminating the wasted space the
- abandoned memos occupy. This is similar to the DBPACK.BAS utility shown
- in Chapter 7, and also similar to the way that BASIC compacts string memory
- when it becomes full. But when an existing memo is relocated in the memo
- file, the field in the main data file that points to the memo must also be
- updated. And that's why the record number back pointer is needed: so the
- compaction program can know which record in the main file must be updated.
- The "L" identifier in the memo file in Figure 8-5, shown between the
- record number and memo text, is a length byte or word that tells how long
- the text is. If you plan to limit the memo field lengths to 255 or fewer
- characters, then a single byte is sufficient. Otherwise an integer must
- be used. An example of code that reads a data record and then its
- associated memo text is shown below.
-
-
- GET #MainFile, RecNumber, TypeVar
- MemoOffset& = TypeVar.MemoOff
- GET #MemoFile, MemoOffset& + 2, MemoLength%
- Memo$ = SPACE$(MemoLength%)
- GET #MemoFile, , Memo$
-
-
- The first step reads a record from the main data file into a TYPE variable,
- and the second determines where in the memo file the memo text begins. Two
- is added to that offset in the second GET statement, to skip over the
- record number back pointer which isn't needed here. Once the length of the
- memo text is known, a string is assigned to that length, and the actual
- text is read into it.
- If you are using long integer record numbers you would of course use
- MemoOffset& + 4 in the second GET. And if you're using a single byte to
- hold the memo length you would define a fixed-length string to receive
- that byte:
-
-
- DIM Temp AS STRING *1
- GET #MemoFile, MemoOffset& + 2, Temp
- MemoLength = ASC(Temp)
-
-
- Since BASIC doesn't offer a byte-sized integer data type, ASC and STR$ can
- be used to convert between numeric and string formats.
-
-
- ARRAY ELEMENT INSERTION AND DELETION
- ====================================
-
- The last issue related to array and memory manipulation I want to cover
- is inserting and deleting elements. If you intend to maintain file indexes
- or other information in memory and in sorted order, you will need some way
- to insert a new entry. By the same token, deleting an entry in a database
- requires that the parallel index entry also be deleted.
- The most obvious way to insert or delete elements in an array is with
- a FOR/NEXT loop. The first example below inserts an element, and the
- second deletes one.
-
-
- '----- Insert an element:
- Element = 200
- InsertValue = 999
-
- FOR X = UBOUND(Array) TO Element + 1 STEP -1
- Array(X) = Array(X - 1)
- NEXT
- Array(Element) = InsertValue
-
-
- '----- Delete an element:
- Element = 200
- FOR X = Element TO UBOUND(Array) - 1
- Array(X) = Array(X + 1)
- NEXT
- Array(UBOUND(Array)) = 0 'optionally clear last element
-
-
- For integer, long integer, and fixed-length arrays this is about as
- efficient as you can get, short of rewriting the code in assembly language.
- However, with floating point and string arrays the performance is less than
- ideal. Unless a numeric coprocessor is installed, floating point values
- are assigned using interrupts and support code in the emulator library.
- This adds an unnecessary level of complication that also impacts the speed.
- When strings are assigned the situation is even worse, because of the
- memory allocation overhead associated with dynamic string management.
- A better solution for floating point and string arrays is a series of
- SWAP statements. The short program below benchmarks the speed difference
- of the two methods, as it inserts an element into a single precision array.
-
- REDIM Array(1 TO 500)
- CLS
- Element% = 200
- InsertValue = 999
-
- Start = TIMER
- FOR A% = 1 TO 500
- FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
- Array(X%) = Array(X% - 1)
- NEXT
- Array(Element%) = InsertValue
- NEXT
- Done = TIMER
- PRINT USING "##.## seconds when assigning"; Done - Start
-
- Start = TIMER
- FOR A% = 1 TO 500
- FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
- SWAP Array(X%), Array(X% - 1)
- NEXT
- Array(Element%) = InsertValue
- NEXT
- Done = TIMER
- PRINT USING "##.## seconds when swapping"; Done - Start
-
- If you run this program in the BASIC environment, the differences may not
- appear that significant. But when the program is compiled to an executable
- file, the swapping method is more than four times faster. In fact, you
- should never compare programming methods using the BASIC editor for exactly
- this reason. In many cases, the slowness of the interpreting process
- overshadows significant differences between one approach and another.
- String arrays also benefit greatly from using SWAP instead of
- assignments, though the amount of benefit varies depending on the length
- of the strings. If you modify the previous program to use a string array,
- also add this loop to initialize the elements:
-
- FOR X% = 1 TO 500
- Array$(X%) = "String number" + STR$(X)
- NEXT
-
- With BASIC PDS far strings the difference is only slightly less at about
- three to one, due to the added complexity of far data. Also, SWAP will
- always be worse than assignments when inserting or deleting elements in a
- fixed-length string or TYPE array. An assignment merely copies the data
- from one location to another. SWAP, however, must copy the data in both
- directions.
- Understand that when using SWAP with conventional string arrays, the
- data itself is not exchanged. Rather, the four-byte string descriptors are
- copied. But because BASIC program modules store string data in different
- segments, extra work is necessary to determine which descriptor goes with
- which segment. When near strings are being used, only six bytes are
- exchanged, regardless of the length of the strings. Four bytes hold the
- descriptors, and two more store the back pointers.
-
-
- SUMMARY
- =======
-
- This chapter explained many of the finer points of sorting and searching
- all types of data in BASIC. It began with sorting concepts using the
- simple Bubble Sort as a model, and then went on to explain indexed and
- multi-key sorts. One way to implement a multi-key sort is by aligning the
- key fields into adjacent TYPE components. While there are some
- restrictions to this method, it is fairly simple to implement and also
- very fast.
- The Quick Sort algorithm was shown, and the SEEQSORT.BAS program on the
- accompanying disk helps you to understand this complex routine by
- displaying graphically the progress of the comparisons and exchanges as
- they are performed. Along the way you saw how a few simple modifications
- to any string sort routine can be used to sort regardless of
- capitalization, or based on only a portion of a string element.
- You also learned that writing a truly general sort routine that can
- handle any type of data requires dealing exclusively with segment and
- address pointers. Here, assembly language routines are invaluable for
- assisting you when performing the necessary comparisons and data exchanges.
- Although the actual operation of the assembly language routines will be
- deferred until Chapter 12, such routines may easily be added to a BASIC
- program using .LIB and .QLB libraries.
- I mentioned briefly the usefulness of packing and aligning data when
- possible, as an aid to fast sorting. In particular, dates can be packed
- to only three bytes in Year/Month/Day order, and other data such as zip
- codes can be stored in long integers. Because numbers can be compared much
- faster than strings, this helps the sorting routines operate more quickly.
- Array searching was also discussed in depth, and both linear and binary
- search algorithms were shown. As with the sorting routines, searching can
- also employ UCASE$ and MID$ to search regardless of capitalization, or on
- only a portion of each array element. Two versions of the Soundex
- algorithm were given, to let you easily locate names and other data based
- on how they sound.
- Besides showing the more traditional searching methods, I presented
- routines to determine the minimum and maximum values in a numeric array.
- I also discussed some of the ramifications involved when searching floating
- point data, to avoid the inevitable rounding errors that might cause a
- legitimate match to be ignored.
- Finally, some simple ways to insert and delete elements in both string
- and numeric arrays were shown. Although making direct assignments in a
- loop is the most obvious way to do this, BASIC's often-overlooked SWAP
- command can provide a significant improvement in speed.
- The next chapter will conclude this section about hands-on programming
- by showing a variety of program optimization techniques.